NAME
checkpoint - Sun Grid Engine checkpointing environment configuration
file format
DESCRIPTION
Checkpointing is a facility to save the complete status of an executing
program or job and to restore and restart from this so called
checkpoint at a later point of time if the original program or job was
halted, e.g. through a system crash.
Sun Grid Engine provides various levels of checkpointing support (see
sge_ckpt(1)). The checkpointing environment described here is a means
to configure the different types of checkpointing in use for your Sun
Grid Engine cluster or parts thereof. For that purpose you can define
the operations which have to be executed in initiating a checkpoint
generation, a migration of a checkpoint to another host or a restart of
a checkpointed application as well as the list of queues which are
eligible for a checkpointing method.
Supporting different operating systems may easily force Sun Grid Engine
to introduce operating system dependencies for the configuration of the
checkpointing configuration file and updates of the supported operating
system versions may lead to frequently changing implementation details.
Please refer to the <sge_root>/ckpt directory for more information.
Please use the -ackpt, -dckpt, -mckpt or -sckpt options to the qconf(1)
command to manipulate checkpointing environments from the command-line
or use the corresponding qmon(1) dialogue for X-Windows based
interactive configuration.
Note, Sun Grid Engine allows backslashes (\) be used to escape newline
(\newline) characters. The backslash and the newline are replaced with
a space (" ") character before any interpretation.
FORMAT
The format of a checkpoint file is defined as follows:
ckpt_name
The name of the checkpointing environment as defined for ckpt_name in
sge_types(1). To be used in the qsub(1) -ckpt switch or for the
qconf(1) options mentioned above.
interface
The type of checkpointing to be used. Currently, the following types
are valid:
hibernator
The Hibernator kernel level checkpointing is interfaced.
cpr The SGI kernel level checkpointing is used.
cray-ckpt
The Cray kernel level checkpointing is assumed.
transparent
Sun Grid Engine assumes that the jobs submitted with reference
to this checkpointing interface use a checkpointing library such
as provided by the public domain package Condor.
userdefined
Sun Grid Engine assumes that the jobs submitted with reference
to this checkpointing interface perform their private
checkpointing method.
application-level
Uses all of the interface commands configured in the
checkpointing object like in the case of one of the kernel level
checkpointing interfaces (cpr, cray-ckpt, etc.) except for the
restart_command (see below), which is not used (even if it is
configured) but the job script is invoked in case of a restart
instead.
ckpt_command
A command-line type command string to be executed by Sun Grid Engine in
order to initiate a checkpoint.
migr_command
A command-line type command string to be executed by Sun Grid Engine
during a migration of a checkpointing job from one host to another.
restart_command
A command-line type command string to be executed by Sun Grid Engine
when restarting a previously checkpointed application.
clean_command
A command-line type command string to be executed by Sun Grid Engine in
order to cleanup after a checkpointed application has finished.
ckpt_dir
A file system location to which checkpoints of potentially considerable
size should be stored.
ckpt_signal
A Unix signal to be sent to a job by Sun Grid Engine to initiate a
checkpoint generation. The value for this field can either be a
symbolic name from the list produced by the -l option of the kill(1)
command or an integer number which must be a valid signal on the
systems used for checkpointing.
when
The points of time when checkpoints are expected to be generated.
Valid values for this parameter are composed by the letters s, m, x and
r and any combinations thereof without any separating character in
between. The same letters are allowed for the -c option of the qsub(1)
command which will overwrite the definitions in the used checkpointing
environment. The meaning of the letters is defined as follows:
s A job is checkpointed, aborted and if possible migrated if the
corresponding sge_execd(8) is shut down on the job's machine.
m Checkpoints are generated periodically at the min_cpu_interval
interval defined by the queue (see queue_conf(5)) in which a job
executes.
x A job is checkpointed, aborted and if possible migrated as soon
as the job gets suspended (manually as well as automatically).
r A job will be rescheduled (not checkpointed) when the host on
which the job currently runs went into unknown state and the
time interval reschedule_unknown (see sge_conf(5)) defined in
the global/local cluster configuration will be exceeded.
RESTRICTIONS
Note, that the functionality of any checkpointing, migration or restart
procedures provided by default with the Sun Grid Engine distribution as
well as the way how they are invoked in the ckpt_command, migr_command
or restart_command parameters of any default checkpointing environments
should not be changed or otherwise the functionality remains the full
responsibility of the administrator configuring the checkpointing
environment. Sun Grid Engine will just invoke these procedures and
evaluate their exit status. If the procedures do not perform their
tasks properly or are not invoked in a proper fashion, the
checkpointing mechanism may behave unexpectedly, Sun Grid Engine has no
means to detect this.
SEE ALSO
sge_intro(1), sge_ckpt(1), sge__types(1), qconf(1), qmod(1), qsub(1),
sge_execd(8).
COPYRIGHT
See sge_intro(1) for a full statement of rights and permissions.