globus_gram_job_manager_interface_tutorial - GRAM Job Manager Scheduler

NAME

       globus_gram_job_manager_interface_tutorial - GRAM Job Manager Scheduler
       Tutorial This tutorial describes the steps needed to build a GRAM Job
       Manager Scheduler interface package.

       The audience for this tutorial is a person interested in adding support
       for a new scheduler interface to GRAM. This tutorial will assume some
       familiarty with GTP, autoconf, automake, and Perl. As a reference
       point, this tutorial will refer to the code in the LSF Job Manager
       package.

Writing a Scheduler Interface

       This section deals with writing the perl module which implements the
       interface between the GRAM job manager and the local scheduler. Consult
       the Job Manager Scheduler Interface section of this manual for a more
       detailed reference on the Perl modules which are used here.

       The scheduler interface is implemented as a Perl module which is a
       subclass of the Globus::GRAM::JobManager module. Its name must match
       the scheduler type string used when the service is installed. For the
       LSF scheduler, the name is lsf, so the module name is
       Globus::GRAM::JobManager::lsf and it is stored in the file lsf.pm.
       Though there are several methods in the JobManager interface, they only
       ones which absolutely need to be implemented in a scheduler module are
       submit, poll, cancel.

       We’ll begin by looking at the start of the lsf source module, lsf.in
       (the transformation to lsf.pm happens when the setup script is run. To
       begin the script, we import the GRAM support modules into the scheduler
       module’s namespace, declare the module’s namespace, and declare this
       module as a subclass of the Globus::GRAM::JobManager module. All
       scheduler packages will need to do this, substituting the name of the
       scheduler type being implemented where we see lsf below.

       use Globus::GRAM::Error;
       use Globus::GRAM::JobState;
       use Globus::GRAM::JobManager;
       use Globus::Core::Paths;

       package Globus::GRAM::JobManager::lsf;

       @ISA = qw(Globus::GRAM::JobManager);

       Next, we declare any system-specifc values which will be substituted
       when the setup package scripts are run. In the LSF case, we need the
       know the paths to a few programs which interact with the scheduler:

       my ($mpirun, $bsub, $bjobs, $bkill);

       BEGIN
       {
           $mpirun = ’@MPIRUN@’;
           $bsub   = ’@BSUB@’;
           $bjobs  = ’@BJOBS@’;
           $bkill  = ’@BKILL@’;
       }

       The values surrounded by the at-sign (such as @MPIRUN@) will be
       replaced by with the path to the named programs by the find-lsf-tools
       script described below.

   Writing a constructor
       For scheduler interfaces which need to setup some data before calling
       their other methods, they can overload the new method which acts as a
       constructor. Scheduler scripts which don’t need any per-instance
       initialization will not need to provide a constructor, the
       Globus::GRAM::JobManager constructor will do the job.

       If you do need to overloaded this method, be sure to call the
       JobManager module’s constructor to allow it to do its initialization,
       as in this example:

       sub new
       {
           my $proto = shift;
           my $class = ref($proto) || $proto;
           my $self = $class->SUPER::new(@_);

           ## Insert scheduler-specific startup code here

           return $self;
       }

       The job interface methods are called with only one argument, the
       scheduler object itself. That object contains the a
       Globus::GRAM::JobDescription object ($self->{JobDescription}) which
       includes the values from the RSL string associated with the request, as
       well as a few extra values:

       job_id
           The string returned as the value of JOB_ID in the return hash from
           submit. This won’t be present for methods called before the job is
           submitted.

       uniq_id
           A string associated with this job request by the job manager
           program. It will be unique for all jobs on a host for all time.

       cache_tag
           The GASS cache tag related to this job submission. Files in the
           cache with this tag will be cleaned by the cleanup_cache() method.

       Now, let’s look at the methods which will interface to the scheduler.

   Submitting Jobs
       All scheduler modules must implement the submit method. This method is
       called when the job manager wishes to submit the job to the scheduler.
       The information in the original job request RSL string is available to
       the scheduler interface through the JobDescription data member of it’s
       hash.

       For most schedulers, this is the longest method to be implemented, as
       it must decide what to do with the job description, and convert them to
       something which the scheduler can understand.

       We’ll look at some of the steps in the LSF manager code to see how the
       scheduler interface is implemented.

       In the beginning of the submit method, we’ll get our parameters and
       look up the job description in the manager-specific object:

       sub submit
       {
           my $self = shift;
           my $description = $self->{JobDescription};

       Then we will check for values of the job parameters that we will be
       handling. For example, this is how we check for a valid job type in the
       LSF scheduler interface:

       if(defined($description->jobtype())
       {
           if($description->jobtype !~ /^(mpi|single|multiple)$/)
           {
               return Globus::GRAM::Error::JOBTYPE_NOT_SUPPORTED;
           }
           elsif($description->jobtype() eq ’mpi’ && $mpirun eq ’no’)
           {
               return Globus::GRAM::Error::JOBTYPE_NOT_SUPPORTED;
           }
       }

       The lsf module supports most of the core RSL attributes, so it does
       more processing to determine what to do with the values in the job
       description.

       Once we’ve inspected the JobDescription we’ll know what we need to tell
       the scheduler about so that it’ll start the job properly. For LSF, we
       will construct a job description script and pass that to the bsub
       command. This script is a bourne shell script with some special
       comments which LSF uses to decide what constraints to use when
       scheduling the job.

       First, we’ll open the new file, and write the file header:

           $lsf_job_script = new IO::File($lsf_job_script_name, ’>’);

           $lsf_job_script->print<<EOF;
       #! /bin/sh
       #
       # LSF batch job script built by Globus Job Manager
       #
       EOF

       Then, we’ll add some special comments to pass job constraints to LSF:

       if(defined($queue))
       {
           $lsf_job_script->print(’#BSUB -q $queue0);
       }
       if(defined($description->project()))
       {
           $lsf_job_script->print(’#BSUB -P ’ . $description->project() . ’0);
       }

       Before we start the executable in the LSF job description script, we
       will quote and escape the job’s arguments so that they will be passed
       to the application as they were in the job submission RSL string:

       At the end of the job description script, we actually run the
       executable named in the JobDescription. For LSF, we support a few
       different job types which require different startup commands. Here, we
       will quote and escape the strings in the argument list so that the
       values of the arguments will be identical to those in the initial job
       request string. For this Bourne-shell syntax script, we will double-
       quote each argument, and escaping the backslash (\), dollar-sign ($),
       double-quote ("), and single-quote (’) characters. We will use this new
       string later in the script.

           @arguments = $description->arguments();

           foreach(@arguments)
           {
               if(ref($_))
               {
                   return Globus::GRAM::Error::RSL_ARGUMENTS;
               }
           }
           if($arguments[0])
           {
               foreach(@arguments)
               {
                    $_ =~ s/\/\\/g;
                    $_ =~ s/\g;
                    $_ =~ s/’/\´/g;
                    $_ =~ s/‘/\`/g;

                    $args .= ’’’ . $_ . ’’ ’;
               }
           }
           else
           {
               $args = ’’;
           }

       To end the LSF job description script, we will write the command line
       of the executable to the script. Depending on the job type of this
       submission, we will need to start either one or more instances of the
       executable, or the mpirun program which will start the job with the
       executable count in the JobDescription:

       if($description->jobtype() eq ’mpi’)
       {
           $lsf_job_script->print(’$mpirun -np ’ . $description->count() . ’ ’);

           $lsf_job_script->print($description->executable()
                                  . ’ $args 0);
       }
       elsif($description->jobtype() eq ’multiple’)
       {
           for(my $i = 0; $i < $description->count(); $i++)
           {
               $lsf_job_script->print($description->executable() . ’ $args &0);
           }
           $lsf_job_script->print(’wait0);
       }
       else
       {
           $lsf_job_script->print($description->executable() . ’ $args0);
       }

       Next, we submit the job to the scheduler. Be sure to close the script
       file before trying to redirect it into the submit command, or some of
       the script file may be buffered and things will fail in strange ways!

       When the submission command returns, we check its output for the
       scheduler-specific job identifier. We will use this value to be able to
       poll or cancel the job.

       The return value of the script should be either a GRAM error object or
       a reference to a hash of values. The Globus::GRAM::JobManager
       documentation lists the valid keys to that hash. For the submit method,
       we’ll return the job identifier as the value of JOB_ID in the hash. If
       the scheduler returned a job status result, we could return that as
       well. LSF does not, so we’ll just check for the job ID and return it,
       or if the job fails, we’ll return an error object:

           $lsf_job_script->close();

           $job_id = (grep(/is submitted/,
                         split(/0, ‘$bsub < $lsf_job_script_name‘)))[0];
           if($? == 0)
           {
               $job_id =~ m/<([^>]*)>/;
               $job_id = $1;

               return { JOB_ID => $job_id };
           }

           return Globus::GRAM::Error::INVALID_SCRIPT_REPLY;
       }

       That finishes the submit method. Most of the functionality for the
       scheduler interface is now written. We just have a few more (much
       shorter) methods to implement.

   Polling Jobs
       All scheduler modules must also implement the poll method. The purpose
       of this method is to check for updates of a job’s status, for example,
       to see if a job has finished.

       When this method is called, we’ll get the job ID (which we returned
       from the submit method above) as well as the original job request
       information in the object’s JobDescription. In the LSF script, we’ll
       pass the job ID to the bjobs program, and that will return the job’s
       status information. We’ll compare the status field from the bjobs
       output to see what job state we should return.

       If the job fails, and there is a way to determine that from the
       scheduler, then the script should return in its hash both

       JOB_STATE => Globus::GRAM::JobState::FAILED

        and

       ERROR => Globus::GRAM::Error::<ERROR_TYPE>->value

       Here’s an excerpt from the LSF scheduler module implementation:

       sub poll
       {
           my $self = shift;
           my $description = $self->{JobDescription};
           my $job_id = $description->jobid();
           my $state;
           my $status_line;

           $self->log(’polling job $job_id’);

           # Get first line matching job id
           $_ = (grep(/$job_id/, ‘$bjobs $job_id 2>/dev/null‘))[0];

           # Get 3th field (status)
           $_ = (split(/))[2];

           if(/PEND/)
           {
               $state = Globus::GRAM::JobState::PENDING;
           }
           elsif(/USUSP|SSUSP|PSUSP/)
           {
               $state = Globus::GRAM::JobState::SUSPENDED
           }
           ...
           return {JOB_STATE => $state};
       }

   Cancelling Jobs
       All scheduler modules must also implement the cancel method. The
       purpose of this method is to cancel a running job.

       As with the poll method described above, this method will be given the
       job ID as part of the JobDescription object held by the manager object.
       If the scheduler interface provides feedback that the job was cancelled
       successfully, then we can return a JOB_STATE change to the FAILED
       state. Otherwise we can return an empty hash reference, and let the
       poll method return the state change next time it is called.

       To process a cancel in the LSF case, we will run the bkill command with
       the job ID.

       sub cancel
       {
           my $self = shift;
           my $description = $self->{JobDescription};
           my $job_id = $description->jobid();

           $self->log(’cancel job $job_id’);

           system(’$bkill $job_id >/dev/null 2>/dev/null’);

           if($? == 0)
           {
               return { JOB_STATE => Globus::GRAM::JobState::FAILED }
           }
           return Globus::GRAM::Error::JOB_CANCEL_FAILED;

       }

   End of the script
       It is required that all perl modules return a non-zero value when they
       are parsed. To do this, make sure the last line of your module consists
       of:

       1;

Setting up a Scheduler

       Once we’ve written the job manager script, we need to get it installed
       so that the gatekeeper will be able to run our new service. We do this
       by writing a setup script. For LSF, we will write the script setup-
       globus-job-manager-lsf.pl, which we will list in the LSF package as the
       Post_Install_Program.

       To set up the Gatekeeper service, our LSF setup script does the
       following:

       1.  Perform system-specific configuration.

       2.  Install the GRAM scheduler Perl module and register as a gatekeeper
           service.

       3.  (Optional) Install an RSL validation file defining extra scheduler-
           specific RSL attributes which the scheduler interface will support.

       4.  Update the GPT metadata to indicate that the job manager service
           has been set up.

   System-Specific Configuration
       First, our scheduler setup script probes for any system-specific
       information needed to interface with the local scheduler. For example,
       the LSF scheduler uses the mpirun, bsub, bqueues, bjobs, and bkill
       commands to submit, poll, and cancel jobs. We’ll assume that the
       administrator who is installing the package has these commands in their
       path. We’ll use an autoconf script to locate the executable paths for
       these commands and substitute them into our scheduler Perl module. In
       the LSF package, we have the find-lsf-tools script, which is generated
       during bootstrap by autoconf from the find-lsf-tools.in file:

       ## Required Prolog

       AC_REVISION($Revision: 1.5 $)
       AC_INIT(lsf.in)

       # checking for the GLOBUS_LOCATION

       if test ’x$GLOBUS_LOCATION’ = ’x’; then
           echo ’ERROR Please specify GLOBUS_LOCATION’ >&2
           exit 1
       fi

       ## Check for optional tools, warn if not found

       AC_PATH_PROG(MPIRUN, mpirun, no)
       if test ’$MPIRUN’ = ’no’ ; then
           AC_MSG_WARN([Cannot locate mpirun])
       fi

       ## Check for required tools, error if not found

       AC_PATH_PROG(BSUB, bsub, no)
       if test ’$BSUB’ = ’no’ ; then
           AC_MSG_ERROR([Cannot locate bsub])
       fi

       ## Required epilog - update scheduler specific module

       prefix=’$(GLOBUS_LOCATION)’
       exec_prefix=’$(GLOBUS_LOCATION)’
       libexecdir=${prefix}/libexec

       AC_OUTPUT(
           lsf.pm:lsf.in
       )

       If this script exits with a non-zero error code, then the setup script
       propagates the error to the caller and exits without installing the
       service.

   Registering as a Gatekeeper Service
       Next, the setup script installs it’s perl module into the perl library
       directory and registers an entry in the Globus Gatekeeper’s service
       directory. The program globus-job-manager-service (distributed in the
       job manager program setup package) performs both of these tasks. When
       run, it expects the scheduler perl module to be located in the
       $GLOBUS_LOCATION/setup/globus directory.

       $libexecdir/globus-job-manager-service -add -m lsf -s jobmanager-lsf;

   Installing an RSL Validation File
       If the scheduler script implements RSL attributes which are not part of
       the core set supported by the job manager, it must publish them in the
       job manager’s data directory. If the scheduler script wants to set some
       default values of RSL attributes, it may also set those as the default
       values in the validation file.

       The format of the validation file is described in the RSL Validation
       File Format section of the documentation. The validation file must be
       named scheduler-type.rvf and installed in the
       $GLOBUS_LOCATION/share/globus_gram_job_manager directory.

       In the LSF setup script, we check the list of queues supported by the
       local LSF installation, and add a section of acceptable values for the
       queue RSL attribute:

       open(VALIDATION_FILE,
            ’>$ENV{GLOBUS_LOCATION}/share/globus_gram_job_manager/lsf.rvf’);

       # Customize validation file with queue info
       open(BQUEUES, ’bqueues -w |’);

       # discard header
       $_ = <BQUEUES>;
       my @queues = ();

       while(<BQUEUES>)
       {
           chomp;

           $_ =~ m/^()/;

           push(@queues, $1);
       }
       close(BQUEUES);

       if(@queues)
       {
           print VALIDATION_FILE ’Attribute: queue0;
           print VALIDATION_FILE join(’ ’, ’Values:’, @queues);

       }
       close VALIDATION_FILE;

   Updating GPT Metadata
       Finally, the setup package should create and finalize a
       Grid::GPT::Setup. The value of $package must be the same value as the
       gpt_package_metadata Name attribute in the package’s metadata file. If
       either the new() or finish() methods fail, then it is considered good
       practice to clean up any files created by the setup script. From setup-
       globus-job-manager-lsf.pl:

       my $metadata =
           new Grid::GPT::Setup(
               package_name => ’globus_gram_job_manager_setup_lsf’);

       $metadata->finish();

Packaging

       Now that we’ve written a job manager scheduler interface, we’ll package
       it using GPT to make it easy for our users to build and install. We’ll
       start by gathering the different files we’ve written above into a
       single directory lsf.

       · lsf.in

       · find-lsf-tools.in

       · setup-globus-job-manager.pl

   Package Documentation
       If there are any scheduler-specific options defined for this scheduler
       module, or if there any any optional setup items, then it is good to
       provide a documentation page which describes these. For LSF, we
       describe the changes since the last version of this package in the file
       globus_gram_job_manager_lsf.dox. This file consists of a doxygen
       mainpage. See www.doxygen.org for information on how to write
       documentation with that tool.

   configure.in
       Now, we’ll write our configure.in script. This file is converted to the
       configure shell script by the bootstrap script below. Since we don’t do
       any probes for compile-time tools or system characteristics, we just
       call the various initialization macros used by GPT, declare that we may
       provide doxygen documentation, and then output the files we need
       substitions done on.

       AC_REVISION($Revision: 1.5 $)
       AC_INIT(Makefile.am)

       GLOBUS_INIT
       AM_PROG_LIBTOOL

       dnl Initialize the automake rules the last argument
       AM_INIT_AUTOMAKE($GPT_NAME, $GPT_VERSION)

       LAC_DOXYGEN(’../’, ’*.dox’)

       GLOBUS_FINALIZE

       AC_OUTPUT(
               Makefile
               pkgdata/Makefile
               pkgdata/pkg_data_src.gpt
               doxygen/Doxyfile
               doxygen/Doxyfile-internal
               doxygen/Makefile
       )

   Package Metadata
       Now we’ll write our metadata file, and put it in the pkgdata
       subdirectory of our package. The important things to note in this file
       are the package name and version, the post_install_program, and the
       setup sections. These define how the package distribution will be
       named, what command will be run by gpt-postinstall when this package is
       installed, and what the setup dependencies will be written when the
       Grid::GPT::Setup object is finalized.

       <?xml version=’1.0’ encoding=’UTF-8’?>
       <!DOCTYPE gpt_package_metadata SYSTEM ’package.dtd’>

       <gpt_package_metadata Format_Version=’0.02’ Name=’globus_gram_job_manager_setup_lsf’ >

         <Aging_Version Age=’0’ Major=’1’ Minor=’0’ />
         <Description >LSF Job Manager Setup</Description>
         <Functional_Group >ResourceManagement</Functional_Group>
         <Version_Stability Release=’Beta’ />
         <src_pkg >

           <With_Flavors build=’no’ />
           <Source_Setup_Dependency PkgType=’pgm’ >
             <Setup_Dependency Name=’globus_gram_job_manager_setup’ >
               <Version >
                 <Simple_Version Major=’3’ />
               </Version>
             </Setup_Dependency>
             <Setup_Dependency Name=’globus_common_setup’ >
               <Version >
                 <Simple_Version Major=’2’ />
               </Version>
             </Setup_Dependency>
           </Source_Setup_Dependency>

           <Build_Environment >
             <cflags >@GPT_CFLAGS@</cflags>
             <external_includes >@GPT_EXTERNAL_INCLUDES@</external_includes>
             <pkg_libs > </pkg_libs>
             <external_libs >@GPT_EXTERNAL_LIBS@</external_libs>
           </Build_Environment>

           <Post_Install_Message >
             Run the setup-globus-job-manager-lsf setup script to configure an
             lsf job manager.
           </Post_Install_Message>

           <Post_Install_Program >
             setup-globus-job-manager-lsf
           </Post_Install_Program>

           <Setup Name=’globus_gram_job_manager_service_setup’ >
             <Aging_Version Age=’0’ Major=’1’ Minor=’0’ />
           </Setup>

         </src_pkg>

       </gpt_package_metadata>

   Automake Makefile.am
       The automake Makefile.am for this package is short because there isn’t
       any compilation needed for this package. We just need to define what
       needs to be installed into which directory, and what source files need
       to be put inot our source distribution. For the LSF package, we need to
       list the lsf.in, find-lsf-tools, and setup-globus-job-manager-lsf.pl
       scripts as files to be installed into the setup directory. We need to
       add those files plus our documentation source file to the EXTRA_LIST
       variable so that they will be included in source distributions. The
       rest of the lines in the file are needed for proper interaction with
       GPT.

       include $(top_srcdir)/globus_automake_pre
       include $(top_srcdir)/globus_automake_pre_top

       SUBDIRS = pkgdata doxygen

       setup_SCRIPTS =     lsf.in     find-lsf-tools     setup-globus-job-manager-lsf.pl

       EXTRA_DIST = $(setup_SCRIPTS) globus_gram_job_manager_lsf.dox

       include $(top_srcdir)/globus_automake_post
       include $(top_srcdir)/globus_automake_post_top

   Bootstrap
       The final piece we need to write for our package is the bootstrap
       script. This script is the standard bootstrap script for a globus
       package, with an extra line to generate the fine-lsf-tools script using
       autoconf.

       #!/bin/sh

       # checking for the GLOBUS_LOCATION

       if test ’x$GLOBUS_LOCATION’ = ’x’; then
           echo ’ERROR Please specify GLOBUS_LOCATION’ >&2
           exit 1
       fi

       if [ ! -f ${GLOBUS_LOCATION}/libexec/globus-bootstrap.sh ]; then
           echo ’ERROR: Unable to locate GLOBUS_LOCATION}/libexec/globus-bootstrap.sh’
           echo ’       Please ensure that you have installed the globus-core package and’
           echo ’       that GLOBUS_LOCATION is set to the proper directory’
           exit
       fi

       autoconf find-lsf-tools.in > find-lsf-tools
       chmod 755 find-lsf-tools

       exit 0

Building, Testing, and Debugging

With this all done, we can now try to build our now package. To do so,
we’ll need to run

% ./bootstrap
% ./globus-build

If all of the files are written correctly, this should result in our
package being installed into $GLOBUS_LOCATION. Once that is done, we
should be able to run gpt-postinstall to configure our new job manager.

Now, we should be able to run the command

% globus-personal-gatekeeper -start -jmtype lsf

to start a gatekeeper configured to run a job manager using our new
scripts. Running this will output a contact string (referred to as
<contact-string> below), which we can use to connect to this new
service. To do so, we’ll run globus-job-run to submit a test job:

% globus-job-run <contact-string> /bin/echo Hello, LSF
Hello, LSF

When Things Go Wrong
If the test above fails, or more complicated job failures are
occurring, then you’lll have to debug your scheduler interface. Here
are a few tips to help you out.

Make sure that your script is valid Perl. If you run

perl -I$GLOBUS_LOCATION/lib/perl $GLOBUS_LOCATION/lib/perl/Globus/GRAM/JobManager/lsf.pm

You should get no output. If there are any diagnostics, correct them
(in the lsf.in file), reinstall your package, and rerun the setup
script.

Look at the Globus Toolkit Error FAQ and see if the failure is perhaps
not related to your scheduler script at all.

Enable logging for the job manager. By default, the job manager is
configured to log only when it notices a job failure. However, if your
problem is that your script is not returning a failure code when you
expect, you might want to enable logging always. To do this, modify the
job manager configuration file to contain ’-save-logfile&nbsp;always’
in place of ’-save-log&nbsp;on_error’.

Adding logging messages to your script: the JobManager object
implements a log method, which allows you to write messages to the job
manager log file. Do this as your methods are called to pinpoint where
the error occurs.

Save the job description file when your script is run. This will allow
you to run the globus-job-manager-script.pl interactively (or in the
Perl debugger). To save the job description file, you can do

$self->{JobDescription}->save(’/tmp/job_description.$$’);

in any of the methods you’ve implemented.

Version 10.42 globus_gram_job_manager_interface_tutorial(3)