Man Linux: Main Page and Category List

NAME

       srun_cr - run parallel jobs with checkpoint/restart support

SYNOPSIS

       srun_cr [OPTIONS...]

DESCRIPTION

       The  design  of  srun_cr  is  inspired  by mpiexec_cr from MVAPICH2 and
       cr_restart form BLCR.  It is a  wrapper  around  the  srun  command  to
       enable  batch  job  checkpoint/restart  support  when used with SLURM’s
       checkpoint/blcr plugin.

OPTIONS

       The srun_cr execute line options are identical to  those  of  the  srun
       command.  See "man srun" for details.

DETAILS

       After  initialization,  srun_cr  registers  a  thread  context callback
       function.  Then it forks a process and executes  "cr_run  --omit  srun"
       with  its  arguments.   cr_run  is employed to exclude the srun process
       from being  dumped  upon  checkpoint.   All  catchable  signals  except
       SIGCHLD  sent  to  srun_cr will be forwarded to the child srun process.
       SIGCHLD will be captured to mimic the  exit  status  of  srun  when  it
       exits.   Then  srun_cr  loops  waiting  for  termination of tasks being
       launched from srun.

       The step launch logic of SLURM is augmented to check if srun is running
       under  srun_cr.   If true, the environment variable SURN_SRUN_CR_SOCKET
       should be present, the value of which is the address of a  Unix  domain
       socket  created and listened to be srun_cr.  After launching the tasks,
       srun tires to connect to the socket and sends the job ID, step  ID  and
       the nodes allocated to the step to srun_cr.

       Upon checkpoint, srun_cr checks to see if the tasks have been launched.
       If not srun_cr first forwards the checkpoint request to  the  tasks  by
       calling  the  SLURM  API  slurm_checkpoint_tasks()  before  dumping its
       process context.

       Upon restart, srun_cr checks to see if the tasks have  been  previously
       launched   and   checkpointed.    If  true,  the  environment  variable
       SLURM_RESTART_DIR is set to the directory of the checkpoint image files
       of the tasks.  Then srun is forked and executed again.  The environment
       variable will be used by the srun command to restart execution  of  the
       tasks from the previous checkpoint.

COPYING

       Copyright  (C)  2009  National University of Defense Technology, China.
       Produced at National  University  of  Defense  Technology,  China  (cf,
       DISCLAIMER).  CODE-OCEC-09-009. All rights reserved.

       This  file  is  part  of  SLURM,  a  resource  management program.  For
       details, see <https://computing.llnl.gov/linux/slurm/>.

       SLURM is free software; you can redistribute it and/or modify it  under
       the  terms  of  the GNU General Public License as published by the Free
       Software Foundation; either version 2  of  the  License,  or  (at  your
       option) any later version.

       SLURM  is  distributed  in the hope that it will be useful, but WITHOUT
       ANY WARRANTY; without even the implied warranty of  MERCHANTABILITY  or
       FITNESS  FOR  A PARTICULAR PURPOSE.  See the GNU General Public License
       for more details.

SEE ALSO

       srun(1)