NAME
sge_ckpt.1 - the Sun Grid Engine checkpointing mechanism and
checkpointing support
DESCRIPTION
Sun Grid Engine supports two levels of checkpointing: the user level
and a operating system provided transparent level. User level
checkpointing refers to applications, which do their own checkpointing
by writing restart files at certain times or algorithmic steps and by
properly processing these restart files when restarted.
Transparent checkpointing has to be provided by the operating system
and is usually integrated in the operating system kernel. An example
for a kernel integrated checkpointing facility is the Hibernator
package from Softway for SGI IRIX platforms.
Checkpointing jobs need to be identified to the Sun Grid Engine system
by using the -ckpt option of the qsub1() command. The argument to this
flag refers to a so called checkpointing environment, which defines the
attributes of the checkpointing method to be used (see checkpoint5()
for details). Checkpointing environments are setup by the qconf1()
options -ackpt, -dckpt, -mckpt and -sckpt. The qsub1() option -c can be
used to overwrite the when attribute for the referenced checkpointing
environment.
If a queue is of the type CHECKPOINTING, jobs need to have the
checkpointing attribute flagged (see the -ckpt option to qsub1()) to be
permitted to run in such a queue. As opposed to the behavior for
regular batch jobs, checkpointing jobs are aborted under conditions,
for which batch or interactive jobs are suspended or even stay
unaffected. These conditions are:
o Explicit suspension of the queue or job via qmod1() by the cluster
administration or a queue owner if the x occasion specifier (see
qsub1() -c and checkpoint5()) was assigned to the job.
o A load average value exceeding the suspend threshold as configured
for the corresponding queues (see queue_conf5().)
o Shutdown of the Sun Grid Engine execution daemon sge_execd8() being
responsible for the checkpointing job.
After abortion, the jobs will migrate to other queues unless they were
submitted to one specific queue by an explicit user request. The
migration of jobs leads to a dynamic load balancing. Note: The
abortion of checkpointed jobs will free all resources (memory, swap
space) which the job occupies at that time. This is opposed to the
situation for suspended regular jobs, which still cover swap space.
RESTRICTIONS
When a job migrates to a queue on another machine at present no files
are transferred automatically to that machine. This means that all
files which are used throughout the entire job including restart files,
executables and scratch files must be visible or transferred explicitly
(e.g. at the beginning of the job script).
There are also some practical limitations regarding use of disk space
for transparently checkpointing jobs. Checkpoints of a transparently
checkpointed application are usually stored in a checkpoint file or
directory by the operating system. The file or directory contains all
the text, data, and stack space for the process, along with some
additional control information. This means jobs which use a very large
virtual address space will generate very large checkpoint files. Also
the workstations on which the jobs will actually execute may have
little free disk space. Thus it is not always possible to transfer a
transparent checkpointing job to a machine, even though that machine is
idle. Since large virtual memory jobs must wait for a machine that is
both idle, and has a sufficient amount of free disk space, such jobs
may suffer long turnaround times.
SEE ALSO
sge_intro1(,) qconf1(,) qmod1(,) qsub1(,) checkpoint5(,) Sun Grid
Engine Installation and Administration Guide, Sun Grid Engine User's
Guide
COPYRIGHT
See sge_intro1() for a full statement of rights and permissions.