Man Linux: Main Page and Category List


       sge_shepherd - Sun Grid Engine single job controlling agent




       sge_shepherd provides the parent process functionality for a single Sun
       Grid Engine job.  The parent functionality is necessary on UNIX systems
       to  retrieve  resource usage information (see getrusage(2)) after a job
       has finished. In addition, the sge_shepherd  forwards  signals  to  the
       job,  such as the signals for suspension, enabling, termination and the
       Sun Grid Engine checkpointing signal (see sge_ckpt(1) for details).

       The sge_shepherd receives information about the job to be started  from
       the  sge_execd(8).   During the execution of the job it actually starts
       up to 5 child processes. First a prolog script is run if  this  feature
       is  enabled  by the prolog parameter in the cluster configuration. (See
       sge_conf(5).)  Next a parallel environment startup procedure is run  if
       the job is a parallel job. (See sge_pe(5) for more information.)  After
       that, the job  itself  is  run,  followed  by  a  parallel  environment
       shutdown  procedure  for parallel jobs, and finally an epilog script if
       requested by the epilog parameter in  the  cluster  configuration.  The
       prolog  and  epilog scripts as well as the parallel environment startup
       and shutdown procedures are to be  provided  by  the  Sun  Grid  Engine
       administrator  and  are  intended for site-specific actions to be taken
       before and after execution of the actual user job.

       After the  job  has  finished  and  the  epilog  script  is  processed,
       sge_shepherd  retrieves resource usage statistics about the job, places
       them in a job specific subdirectory of the sge_execd(8) spool directory
       for reporting through sge_execd(8) and finishes.

       sge_shepherd  also  places  an exit status file in the spool directory.
       This exit status can be viewed with qacct -j JobId (see  qacct(1));  it
       is not the exit status of sge_shepherd itself but of one of the methods
       executed by sge_shepherd.  This exit status can have several  meanings,
       depending  on in which method an error occurred (if any).  The possible
       methods are:  prolog,  parallel  start,  job,  parallel  stop,  epilog,
       suspend, restart, terminate, clean, migrate, and checkpoint.

       The following exit values are returned:

       0      All methods: Operation was executed successfully.

       99     Job script, prolog and epilog: When FORBID_RESCHEDULE is not set
              in the configuration (see sge_conf(5)), the job gets  re-queued.
              Otherwise see "Other".

       100    Job  script,  prolog and epilog: When FORBID_APPERROR is not set
              in the configuration (see sge_conf(5)), the job gets  re-queued.
              Otherwise see "Other".

       Other  Job script: This is the exit status of the job itself. No action
              is taken upon this exit status because the meaning of this  exit
              status is not known.
              Prolog,  epilog  and  parallel  start: The queue is set to error
              state and the job is re-queued.
              Parallel stop: The queue is set to error state, but the  job  is
              not   re-queued.   It   is  assumed  that  the  job  itself  ran
              successfully and only the clean up script failed.
              Suspend,  restart,  terminate,  clean,   and   migrate:   Always
              Checkpoint: Success, except for kernel checkpointing: checkpoint
              was not successful, did not happen (but migration will happen by
              Sun Grid Engine).


       sge_shepherd  should not be invoked manually, but only by sge_execd(8).


       sgepasswd   contains   a   list   of    user    names     and     their
       corresponding  encrypted  passwords.  If  available, the password  file
       will  be   used   by  sge_shepherd. To change the contents of this file
       please  use  the  sgepasswd  command. It is not advised to  change that
       file manually.
       <execd_spool>/job_dir/<job_id>     job specific directory


       sge_intro(1), sge_conf(5), sge_execd(8).


       See sge_intro(1) for a full statement of rights and permissions.