cr_restart - restarts a process, process group, or session from a

NAME

       cr_restart  -  restarts  a  process,  process  group, or session from a
       checkpoint file.

SYNOPSIS

       cr_restart [options] [checkpoint_file]

DESCRIPTION

       cr_restart restarts a process (or set of processes) from  a  checkpoint
       file created with cr_checkpoint(1).

       A  restarted  process  has all of the attributes they had at checkpoint
       time, including its process id.  If  any  needed  resources  cannot  be
       attained  for the processes in a checkpoint file (ex: a pid is in use),
       cr_restart will fail.  If a process group or session is restarted,  all
       parent/child  relations  and  pipes, etc., between the processes in the
       checkpoint will be correctly restored.

       If the stdin/stdout/stderr of any restarted process was directed  to  a
       terminal  at  checkpoint  time,  it  is  redirected  to the controlling
       terminal of the cr_restart program.

       The current working directory of a restarted process  is  the  same  as
       when  it  was  checkpointed,  regardless  of  where the context file is
       located, or where cr_restart is invoked.

       The cr_restart process becomes the parent of the  ’eldest’  process  in
       any  restarted  job.  This means that getppid(2) may return a different
       value to the eldest process after restart.  When the  eldest  restarted
       process  exits  (or  dies from a signal), cr_restart will exit with the
       same error code (or kill itself with the same signal), so it is largely
       invisible  (it  is necessary to keep cr_restart ‘in-between’ your shell
       and restarted  processes,  however,  as  most  Unix  shells  get  quite
       confused if they observe their children changing process ids).

   Signals
       By  default  restarted  processes  begin  to  run  after the restart is
       complete.  Alternatively, you may specify that  they  be  stopped  (via
       --stop), or terminated/aborted/killed (via --term, --abort, or --kill).
       This is done by sending the appropriate signal to every process that is
       part  of  the  restart.   If the processes were stopped at the time the
       checkpoint was requested, then --cont may be used to  send  SIGCONT  to
       all processes after the restart is completed.

   Error handling
       By  default  cr_restart  will  block  until  the  restarted process has
       completed, and will exit with the same  exit  value  as  the  restarted
       process (even if the restarted process died with a fatal signal).  This
       can make it nearly impossible to determine  if  a  non-zero  exit  from
       cr_restart  is  due  to  a failure to restart, or is the exit code of a
       correctly restarted  process.   The  simple  approach  of  looking  for
       ’Restart failed:’ is not reliable.  Therefore, the --run-on-* family of
       flags are available to  supply  alternative  (or  supplementary)  error
       handling.   When  any  of  the  --run-on-*  flags  is passed, a hook is
       installed for the given category of failure (or  success),  as  defined
       below.  When an error (or success) is detected and a corresponding hook
       is installed, the hook is run via the system(3) function.  If the  exit
       code  of  the  hook  is  non-zero,  then cr_restart returns this value,
       suppressing any error message that would otherwise be generated.  If no
       hook  is installed, the hook is an empty string, or if the hook returns
       an exit code of zero, then an explanatory error message is printed  and
       an  exit  code  related  to  the  errno value at the time of failure is
       returned.

       --run-on-success=’cmd’
              Runs the given command as soon as the restarted process(es)  are
              known  to be running.  If the return value of ’cmd’ is non-zero,
              this also results in cr_restart terminating without  waiting  on
              termination of the restarted process(es).

       --run-on-fail-args=’cmd’
              Runs  the  given  command  if  the  arguments are invalid.  This
              includes the case in which the given context file is missing  or
              unreadable.

       --run-on-fail-temp=’cmd’
              Runs  the  given  command  if a "temporary" failure is detected.
              This includes the case of a required pid being in use.

       --run-on-fail-perm=’cmd’
              Runs the given command if a  "permanent"  failure  is  detected.
              This is most commonly due to a corrupted context file.

       --run-on-fail-env=’cmd’
              Runs   the  given  command  if  an  "environmental"  failure  is
              detected.  This includes when files required for restarting  are
              missing or inaccessible.

       --run-on-failure=’cmd’
              This  installs  the given command for all of the --run-on-fail-*
              hooks.

   File relocation
       By default, files and directories are  saved  ‘by  reference’,  storing
       their   full  pathname  in  the  context  file.   This  includes  files
       associated with a process via open(2) and/or  mmap(2)  and  directories
       associated  via opendir(3) or as the current working directory.  Use of
       --relocate oldpath=newpath  allows  remapping  of  such  paths  to  new
       locations at restart-time.

       When  parsing  the  --relocate argument the sequences ‘\=’ and ‘\\’ are
       interpreted as ‘=’ and ‘\’,  respectively,  to  allow  for  paths  that
       contain  the  ‘=’  character.   The ‘\’ character is not special in any
       other context.  (Note that command shells also have  special  treatment
       of  ‘\’  and you may therefore need quotes or additional ‘\’ characters
       to pass the argument you intend.)

       When file or  directory  associations  are  restored,  the  oldpath  is
       compared  to  the  saved  fullpath  of  each  file or directory.  If it
       matches the leading components of the path,  the  matching  portion  is
       replaced  by the value of newpath.  Note that oldpath must match entire
       path components, and only leading components.  Therefore an oldpath  of
       /tmp/foo  will  match  /tmp/foo  or  /tmp/foo/1,  but will not match to
       /tmp/fooz (not matching the full component  fooz)  or  to  /var/tmp/foo
       (not matching the leading component /var.)

       It  is  important to be aware the the saved fullpaths in a context file
       are the canonical paths.  Therefore the oldpath you provide  must  also
       be  a  canonical  path,  though  the  newpath  doesn’t need to be.  For
       instance, if /tmp  is  a  symbolic  link  to  /var/tmp,  then  if  your
       application  opens  the  file  /tmp/work/1234  the  path  stored in the
       context file will be /var/tmp/work/1234.  Therefore,
              --relocate /tmp/work=/tmp/play
       would not work as desired, but either of the following would:
              --relocate /var/tmp/work=/tmp/play
              --relocate /var/tmp/work=/var/tmp/play

       If the --relocate option is passed multiple times, all are  applied  to
       restored  file  or  directory associations, but only the first match is
       applied to any given path.  Currently a maximum of  16  relocations  is
       supported.

   PID and related identifiers
       By default, processes are restarted with the same pid and thread id (as
       returned by  getpid(2),  and  gettid(2)  respectively).   This  default
       ensures that processes and threads that signal each other and processes
       that wait on children will continue to  function  correctly.   However,
       this prevents restarting concurrent instances of the same context file.

       By default, the process group and session (as returned  by  getpgrp(2),
       and  getsid(2))  are  set  to  those  of  the cr_restart program.  This
       ensures that job control via the requester’s session leader  (typically
       a  login  shell)  will  continue  to function correctly.  However, this
       interferes with any job control or process group signaling that may  be
       take place among the restarted processes.

       There  are  options  to  individually  control whether the pid, process
       group and session are restored to their  saved  values  or  assume  new
       values  (the  process group and session inherited from cr_restart and a
       fresh pid obtained from fork(2)).  There is no separate control for the
       thread ids, as they must always follow the same policy as the pid.  The
       following describes each option, along with outlining some of the risks
       associated with the non-default ones:

       --restore-pid
              (default) This causes pid and thread ids to be restored to their
              saved values.

       --no-restore-pid
              This causes pid and thread ids to assume new values.  Any multi-
              threaded  process  has  the  possibility of using functions like
              tkill(2) which will not behave as desired if the thread ids  are
              not restored.  Similarly, any multi-process application may make
              use  of  kill(2)  or  waitpid(2),  among  others,  that  require
              restored  pids  for  correct operation.  It is also worth noting
              that many versions of glibc will cache the result  of  getpid(),
              which  may  result in calls after restore returning the original
              value, even though the pid was changed by the restart.

       --restore-pgid
              This causes the process group ids to be restored to their  saved
              values.   This  is  required for correct operation of any multi-
              process application that may perform signal or  wait  operations
              on process groups (as by passing a negative pid value to kill(2)
              or waitpid(2), among others), or which uses process  groups  for
              POSIX  job control operations.  This is NOT the default behavior
              because restoring the process group ids will prevent job control
              by the requester’s shell (or other controlling process).

       --no-restore-pgid
              (default)  This  causes  the  restarted  processes  to  join the
              process group of the cr_restart process.

       --restore-sid
              This causes the session  ids  to  be  restored  to  their  saved
              values.   This  is  required, for instance, for systems that are
              performing batch accounting based on the session id.

       --no-restore-sid
              (default) This  causes  the  restarted  processes  to  join  the
              session of the cr_restart process.

       Note  that use of --restore-pgid or --restore-sid will produce an error
       in the case that the required identifiers are in  use  in  the  system.
       This  includes the possibility that they conflict the the process group
       or session of cr_restart.

OPTIONS

   General options:
       -?, --help
              print this help message.

       -v, --version
              print version information.

       -q, --quiet
              suppress error/warning messages to stderr.

   Options for source location of the checkpoint:
       -d, --dir DIR
              checkpoint read from directory DIR, with one  ’context.ID’  file
              per process (unimplemented).

       -f, --file FILE
              checkpoint read from FILE.

       -F, --fd FD
              checkpoint read from an open file descriptor.

              Options  in  this group are mutually exclusive.  If no option is
              given from this group, the default is to take the final argument
              as FILE.

   Options for signal sent to process(es) after restart:
       --run  no signal sent: continue execution (default).

       -S, --signal NUM
              signal NUM sent to all processes/threads.

       --stop SIGSTOP sent to all processes.

       --term SIGTERM sent to all processes.

       --abort
              SIGABRT sent to all processes.

       --kill SIGKILL sent to all processes.

       --cont SIGCONT sent to all processes.

              Options  in this group are mutually exclusive.  If more than one
              is given then only the last will be honored.

   Options for checkpoints of restarted process(es):
       --omit-maybe
              use a heuristic to omit cr_restart from checkpoints (default)

       --omit-always
              always omit cr_restart from checkpoints

       --omit-never
              never omit cr_restart from checkpoints

   Options for alternate error handling:
       --run-on-success=’cmd’
              run the given command on success

       --run-on-fail-args=’cmd’
              run the given command invalid arguments

       --run-on-fail-temp=’cmd’
              run the given command on ’temporary’ failure

       --run-on-fail-env=’cmd’
              run the given command on ’environmental’ failure

       --run-on-fail-perm=’cmd’
              run the given command on ’permanent’ failure

       --run-on-failure=’cmd’
              run the given command on any failure

   Options for relocation:
       --relocate OLDPATH=NEWPATH
              map paths of files and directories to new  locations  by  prefix
              replacement.

       Options for restoring pid, process group and session ids

       --restore-pid
              restore pids to saved values (default).

       --no-restore-pid
              restart with new pids.

       --restore-pgid
              restore pgid to saved values.

       --no-restore-pgid
              restart with new pgids (default).

       --restore-sid
              restore sid to saved values.

       --no-restore-sid
              restart with new sids (default).

              Options  in each restore/no-restore pair are mutually exclusive.
              If both are given then only the last will be honored.

   Options for kernel log messages (default is --kmsg-error):
       --kmsg-none
              don’t report any kernel messages.

       --kmsg-error
              on  restart  failure,  report  on  stderr  any  kernel  messages
              associated with the restart request.

       --kmsg-warning
              report on stderr any kernel messages associated with the restart
              request, regardless of success or failure.   Messages  generated
              in the absence of failure are considered to be warnings.

              Options  in this group are mutually exclusive.  If more than one
              is given then only the last will be honored.  Note that  --quiet
              suppresses all stderr output, including these messages.

AUTHORS

       Jason  Duell, Paul Hargrove, and Eric Roman, Lawrence Berkeley National
       Laboratory.

REPORTING BUGS

       Bug reports may be filed on the web at  http://mantis.lbl.gov/bugzilla.

NAME

SYNOPSIS

DESCRIPTION

OPTIONS

AUTHORS

REPORTING BUGS

SEE ALSO