NAME
srun_cr - run parallel jobs with checkpoint/restart support
SYNOPSIS
srun_cr [OPTIONS...]
DESCRIPTION
The design of srun_cr is inspired by mpiexec_cr from MVAPICH2 and
cr_restart form BLCR. It is a wrapper around the srun command to
enable batch job checkpoint/restart support when used with SLURM’s
checkpoint/blcr plugin.
OPTIONS
The srun_cr execute line options are identical to those of the srun
command. See "man srun" for details.
DETAILS
After initialization, srun_cr registers a thread context callback
function. Then it forks a process and executes "cr_run --omit srun"
with its arguments. cr_run is employed to exclude the srun process
from being dumped upon checkpoint. All catchable signals except
SIGCHLD sent to srun_cr will be forwarded to the child srun process.
SIGCHLD will be captured to mimic the exit status of srun when it
exits. Then srun_cr loops waiting for termination of tasks being
launched from srun.
The step launch logic of SLURM is augmented to check if srun is running
under srun_cr. If true, the environment variable SURN_SRUN_CR_SOCKET
should be present, the value of which is the address of a Unix domain
socket created and listened to be srun_cr. After launching the tasks,
srun tires to connect to the socket and sends the job ID, step ID and
the nodes allocated to the step to srun_cr.
Upon checkpoint, srun_cr checks to see if the tasks have been launched.
If not srun_cr first forwards the checkpoint request to the tasks by
calling the SLURM API slurm_checkpoint_tasks() before dumping its
process context.
Upon restart, srun_cr checks to see if the tasks have been previously
launched and checkpointed. If true, the environment variable
SLURM_RESTART_DIR is set to the directory of the checkpoint image files
of the tasks. Then srun is forked and executed again. The environment
variable will be used by the srun command to restart execution of the
tasks from the previous checkpoint.
COPYING
Copyright (C) 2009 National University of Defense Technology, China.
Produced at National University of Defense Technology, China (cf,
DISCLAIMER). CODE-OCEC-09-009. All rights reserved.
This file is part of SLURM, a resource management program. For
details, see <https://computing.llnl.gov/linux/slurm/>.
SLURM is free software; you can redistribute it and/or modify it under
the terms of the GNU General Public License as published by the Free
Software Foundation; either version 2 of the License, or (at your
option) any later version.
SLURM is distributed in the hope that it will be useful, but WITHOUT
ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License
for more details.
SEE ALSO
srun(1)