LAM SSI RPI - overview of LAM's RPI SSI modules

NAME

       LAM SSI RPI - overview of LAM's RPI SSI modules

DESCRIPTION

       The  "kind"  for  RPI  SSI  modules is "rpi".  Specifically, the string
       "rpi" (without the quotes) should be used to specify which  RPI  should
       be used on the mpirun command line with the -ssi switch.  For example:

       mpirun -ssi rpi tcp C my_mpi_program
           Specifies  to  use  the tcp RPI (and to launch a single copy of the
           executable "foo" on each node).

       The "rpi" string is also used as a prefix send parameters  to  specific
       RPI modules.  For example:

       mpirun -ssi rpi tcp -ssi rpi_tcp_short 131072 C my_mpi_program
           Specifies  to  use  the tcp RPI, and to pass in the value of 131072
           (128K) as the short message length for TCP messages.  See each  RPI
           section  below  for  a  full  description  of  parameters  that are
           accepted by each RPI.

       LAM currently supports five different RPI SSI modules: gm,  lamd,  tcp,
       sysv, usysv.

SELECTING AN RPI MODULE

       Only  one  RPI  module  may  be  selected  per  command execution.  The
       selection of which module occurs during MPI_INIT, and is used  for  the
       duration  of  the MPI process.  It is erroneous to select different RPI
       modules for different processes.

       The kind for selecting an RPI is "rpi".  For example:

       mpriun -ssi rpi tcp C my_mpi_program
           Selects to use the tcp RPI  and  run  a  single  copy  of  the  foo
           exectuable on each node.

AVAILABLE MODULES

       As with all SSI modules, it is possible to pass parameters at run time.
       This section discusses the built-in LAM RPI modules,  as  well  as  the
       run-time parameters that they accept.

       In  the discussion below, the parameters are discussed in terms of kind
       and name.  The kind and name may be specified as command line arguments
       to  the  mpirun  command  with  the  -ssi switch, or they may be set in
       environment variables of the form  LAM_MPI_SSI_name=value.   Note  that
       using  the  -ssi  command  line  switch  will take precendence over any
       environment variables.

       If the RPI that is selected is unable to run (e.g., attempting  to  use
       the  gm  RPI  when  gm  support  was not compiled into LAM, or if no gm
       hardware is available on the nodes), an appropriate error message  will
       be printed and execution will abort.

   crtcp RPI
       The  crtcp RPI is a checkpoint/restart-able version of the tcp RPI (see
       below).   It  is  separate  from  the  tcp  RPI  because  the   current
       implementation  imposes  a  slight  performance  penalty  to enable the
       ability to checkpoint and restart MPI jobs.  Its tunable parameters are
       the  same as the tcp RPI.  This RPI probably only needs to be used when
       the ability to checkpoint and restart MPI jobs is required.

       See the LAM/MPI User's Guide for more details on the crtcp RPI as  well
       as  the  checkpoint/restart  capabilities of LAM/MPI.  The lamssi_cr(7)
       manual page also contains additional information.

   gm RPI
       The gm RPI is used with native Myrinet networks.  Please note that  the
       gm  RPI exists, but has not yet been optimized.  It gives significantly
       better performance than TCP over Myrinet networks, but has not yet been
       properly tuned and instrumented in LAM.

       That being said, there are several tunable parameters in the gm RPI:

       rpi_gm_maxport N
           If  rpi_gm_port  is not specified, LAM will attempt to find an open
           GM port to use for MPI communications  starting  with  port  1  and
           ending  with  the N value speified by the rpi_gm_maxport parameter.
           If unspecified, LAM will try all existing GM ports.

       rpi_gm_port N
           LAM will attempt to use gm port N for MPI communications.

       rpi_gm_tinymsglen N
           Specifies the maximum message size (in bytes) for  "tiny"  messages
           (i.e.,  messages  that  are sent entirely in one gm message).  Tiny
           messages are memcpy'ed into the header before it  is  sent  to  the
           destination,  and  memcpy'ed out of the header into the destination
           buffer on the receiver.  Hence, it is not advisable  to  make  this
           value too large.

       rpi_gm_fast 1
           Specifies to use the "fast" protocol for sending short gm messages.
           Unreliable in the presence of GM errors or timeouts; this parameter
           is  not  advised  for MPI applications that essentially do not make
           continual progress within MPI.

       rpi_gm_cr 1
           Enable checkpoint/restart  behavior  for  gm.   This  can  only  be
           enabled  if  the  gm  rpi  module was compiled with support for the
           gm_get() function, which is  disabled  by  default.   See  the  LAM
           Installation  and  User's  Guides  for  more  information  on  this
           parameter before you use it.

   lamd RPI
       The lamd RPI  uses  LAM's  "out-of-band"  communication  mechanism  for
       passing  MPI  messages.   Specifically,  MPI messages are sent from the
       user process to the local LAM daemon, then to the remote LAM daemon (if
       the  destination  process  is  on  a  different  node), and then to the
       destination process.

       While this adds latency to message passing because of  the  extra  hops
       that  each message must travel, it allows for true asynchronous message
       passing.  Since the LAM daemon is running in its own  execution  space,
       it  can  make  progress  on  message  passing regardless of the state /
       status of the user's program.  This can be an overall  net  savings  in
       performance and execution time for some classes of MPI programs.

       It  is  expected  that  this  RPI will someday become obsolete when LAM
       becomes multi-threaded and  allows  progress  to  be  made  on  message
       passing in separate threads rather than in separate processes.

       The lamd RPI has no tunable parameters.

   tcp RPI
       The tcp RPI uses pure TCP for all MPI message passing.  TCP sockets are
       opened between MPI processes and are used for all MPI traffic.

       The tcp RPI has one tunable parameter:

       rpi_tcp_short <bytes>
           Tells the tcp RPI the smallest size (in bytes) for a message to  be
           considered  "long".   Short  messages are sent eagerly (even if the
           receiving side  is  not  expecting  them).   Long  messages  use  a
           rendevouz  protocol  (i.e.,  a  three-way  handshake) such that the
           message is not actually sent until the receiver  is  expecting  it.
           This value defaults to 64k.

   sysv RPI
       The sysv RPI uses shared memory for communication between MPI processes
       on the same  node,  and  TCP  sockets  for  communication  between  MPI
       processes on different nodes.  System V semaphores are used to lock the
       shared memory pools.  This RPI is best used when running  multiple  MPI
       processes  on  uniprocessors  (or  oversubscribed  SMPs) because of the
       blocking / yielding nature of semaphores.

       The sysv RPI has the following tunable parameters:

       rpi_tcp_short <bytes>
           Since the  sysv  RPI  uses  parts  of  the  tcp  RPI  for  off-node
           communication,  this  parameter also has relevance to the sysv RPI.
           The meaning of this parameter is discussed in the tcp RPI  section.

       rpi_sysv_short <bytes>
           Tells the sysv RPI the smallest size (in bytes) for a message to be
           considered "long".  Short shared memory messages are sent  using  a
           small  "postbox"  protocol; long messages use a more general shared
           memory pool method.  This value defaults to 8k.

       rpi_sysv_pollyield <bool>
           If set to a nonzero number, force the use of a system call to yield
           the  processor.  The system call will be yield(), sched_yield(), or
           select() (with a  1ms  timeout),  depending  what  LAM's  configure
           script finds at configuration time.  This value defaults to 1.

       rpi_sysv_shmpoolsize <bytes>
           The  size  of  the shared memory pool that is used for long message
           transfers.  It is allocated once on each node for each MPI parallel
           job.   Specifically,  if  multiple  MPI  processes  from  the  same
           parallel job are spawned on a single node, this pool will  only  be
           allocated once.

           The  configure  script will try to determine a default size for the
           pool if none is explicitly specified (you should always check  this
           to  see  if  it  is  reasonable).   Larger  values  should  improve
           performance especially when an application passes  large  messages,
           but will also increase the system resources used by each task.

       rpi_sysv_shmmaxalloc <bytes>
           To  prevent  a  single large message transfer from monopolizing the
           global pool, allocations from the pool are actually restricted to a
           maximum   of  rpi_sysv_shmmaxalloc  bytes  each.   Even  with  this
           restriction, it is possible for  the  global  pool  to  temporarily
           become  exhausted.  In  this  case, the transport will fall back to
           using the postbox area to transfer the message. Performance will be
           degraded, but the application will progress.

           The  configure  script will try to determine a default size for the
           maximum atomic transfer size if none is explicitly  specified  (you
           should  always  check  this  to  see  if it is reasonable).  Larger
           values should improve performance especially  when  an  application
           passes  large messages, but will also increase the system resources
           used by each task.

   usysv RPI
       The  usysv  RPI  uses  shared  memory  for  communication  between  MPI
       processes  on  the same node, and TCP sockets for communication between
       MPI processes on different nodes.  Spin locks  are  used  to  lock  the
       shared  memory  pools.   This RPI is best used when the multiple of MPI
       processes on a single node is less than  or  equal  to  the  number  of
       processors  because  it  allows LAM to fully occupy the processor while
       waiting for a message and never be swapped out.

       The usysv RPI has many of the same tunable parameters as the sysv RPI:

       rpi_tcp_short <bytes>
           Same meaning as in the sysv RPI.

       rpi_usysv_short <bytes>
           Same meaning as rpi_sysv_short in the sysv RPI.

       rpi_usysv_pollyield <bool>
           Same meaning as rpi_sysv_pollyield in the sysv RPI.

       rpi_usysv_shmpoolsize <bytes>
           Same meaning as rpi_sysv_shmpoolsize in the sysv RPI.

       rpi_usysv_shmmaxalloc <bytes>
           Same meaning as rpi_sysv_shmmaxalloc in the sysv RPI.

       rpi_usysv_readlockpoll <iterations>
           Number of iterations to spin before yielding  the  processor  while
           waiting to read.  This value defaults to 10,000.

       rpi_usysv_writelockpoll <iterations>
           Number  of  iterations  to spin before yielding the processor while
           waiting to write.  This value defaults to 10.

NAME

DESCRIPTION

SELECTING AN RPI MODULE

AVAILABLE MODULES

SEE ALSO