Name
condor_glidein - add a remote grid resource to a local Condor pool
Synopsis
condor_glidein [ -help ]
condor_glidein [ -admin address ] [ -anybody ] [ -archdir dir ] [
-basedir basedir ] [ -count CPU count ] [ <Execute Task Options> ] [
<Generate File Options> ] [ -gsi_daemon_name cert_name ] [ -idletime
minutes ] [ -install_gsi_trusted_ca_dir path ] [ -install_gsi_gridmap
file ] [ -localdir dir ] [ -memory MBytes ] [ -project name ] [ -queue
name ] [ -runtime minutes ] [ -runonly ] [ <Set Up Task Options> ] [
-suffix suffix ] [ -slots slot count ] <contact argument>
Description
condor_glidein allows the temporary addition of a grid resource to a
local Condor pool. The addition is accomplished by installing and
executing some of the Condor daemons on the remote grid resource, such
that it reports in as part of the local Condor pool. condor_glidein
accomplishes two separate tasks: set up and execution. These separated
tasks allow flexibility, in that the user may use condor_glidein to do
only one of the tasks or both, in addition to customizing the tasks.
The set up task generates a script that may be used to start the Condor
daemons during the execution task, places this script on the remote
grid resource, composes and installs a configuration file, and it
installs the condor_master , condor_startd and condor_starter daemons
on the grid resource.
The execution task runs the script generated by the set up task. The
goal of the script is to invoke the condor_master daemon. The Condor
job glidein_startup appears in the queue of the local Condor pool for
each invocation of condor_glidein . To remove the grid resource from
the local Condor pool, use condor_rm to remove the glidein_startup job.
The Condor jobs to do both the set up and execute tasks utilize Condor-
G and Globus protocols (gt2 or gt4) to communicate with the remote
resource. Therefore, an X.509 certificate (proxy) is required for the
user running condor_glidein .
Specify the remote grid machine with the command line argument <contact
argument> . <contact argument> takes one of 4 forms:
1. hostname
2. Globus contact string
3. hostname/jobmanager-<schedulername>
4. -contactfile filename The argument -contactfile filename
specifies the full path and file name of a file that contains Globus
contact strings. Each of the resources given by a Globus contact
string is added to the local Condor pool.
The set up task of condor_glidein copies the binaries for the correct
platform from a central server. To obtain access to the server, or to
set up your own server, follow instructions on the Glidein Server Setup
page, at http://www.cs.wisc.edu/condor/glidein. Set up need only be
done once per site, as the installation is never removed.
By default, all files installed on the remote grid resource are placed
in the directory $(HOME)/Condor_glidein. $(HOME)is evaluated and
defined on the remote machine using a grid map. This directory must be
in a shared file system accessible by all machines that will run the
Condor daemons. By default, the daemon’s log files will also be written
in this directory. Change this directory with the -localdir option to
make Condor daemons write to local scratch space on the execution
machine. For debugging initial problems, it may be convenient to have
the log files in the more accessible default directory. If using the
default directory, occasionally clean up old log and execute
directories to avoid running out of space.
Examples
To have 10 grid resources running PBS at a grid site with a gatekeeper
named gatekeeper.site.edu join the local Condor pool:
% condor_glidein -count 10 gatekeeper.site.edu/jobmanager-pbs
If you try something like the above and condor_glidein is not able to
automatically determine everything it needs to know about the remote
site, it will ask you to provide more information. A typical result of
this process is something like the following command:
% condor_glidein .br
-count 10 .br
-arch 6.6.7-i686-pc-Linux-2.4 .br
-setup_jobmanager jobmanager-fork .br
gatekeeper.site.edu/jobmanager-pbs
The Condor jobs that do the set up and execute tasks will appear in the
queue for the local Condor pool. As a result of a successful glidein,
use condor_status to see that the remote grid resources are part of the
local Condor pool.
A list of common problems and solutions is presented in this manual
page.
Generate File Options
-genconfig
Create a local copy of the configuration file that may be used on
the remote resource. The file is named
glidein_condor_config.<suffix>. The string defined by
<suffix>defaults to the process id (PID) of the condor_glidein
process or is defined with the -suffix command line option. The
configuration file may be edited for later use with the -useconfig
option.
-genstartup
Create a local copy of the script used on the remote resource to
invoke the condor_master . The file is named
glidein_startup.<suffix>. The string defined by <suffix>defaults to
the process id (PID) of the condor_glidein process or is defined
with the -suffix command line option. The file may be edited for
later use with the -usestartup option.
-gensubmit
Generate submit description files, but do not submit. The submit
description file for the set up task is named
glidein_setup.submit.<suffix>. The submit description file for the
execute task is named glidein_run.submit.<suffix>. The string
defined by <suffix>defaults to the process id (PID) of the
condor_glidein process or is defined with the -suffix command line
option.
Set Up Task Options
-setuponly
Do only the set up task of condor_glidein . This option cannot be
run simultaneously with -runonly .
-setup_here
Do the set up task on the local machine, instead of at a remote grid
resource. This may be used, for example, to do the set up task of
condor_glidein in an AFS area that is read-only from the remote grid
resource.
-forcesetup
During the set up task, force the copying of files, even if this
overwrites existing files. Use this to push out changes to the
configuration.
-useconfig config_file
The set up task copies the specified configuration file, rather than
generating one.
-usestartup startup_file
The set up task copies the specified startup script, rather than
generating one.
-setup_jobmanager jobmanagername
Identifies the jobmanager on the remote grid resource to receive the
files during the set up task. If a reasonable default can be
discovered through MDS, this is optional. jobmanagername is a
string representing any gt2 name for the job manager. The correct
string in most cases will be jobmanager-fork . Other common strings
may be jobmanager , jobmanager-condor , jobmanager-pbs , and
jobmanager-lsf .
Execute Task Options
-runonly
Starts execution of the Condor daemons on the grid resource. If any
of the necessary files or executables are missing, condor_glidein
exits with an error code. This option cannot be run simultaneously
with -setuponly .
-run_here
Runs condor_master directly rather than submitting a Condor job that
causes the remote execution. To instead generate a script that does
this, use -run_here in combination with -gensubmit . This may be
useful for running Condor daemons on resources that are not directly
accessible by Condor.
Options
-help
Display brief usage information and exit.
-basedir basedir
Specifies the base directory on the remote grid resource used for
placing files. The default directory is $(HOME)/Condor_glideinon the
grid resource.
-archdir dir
Specifies the directory on the remote grid resource for placement of
the Condor executables. The default value for -archdir i s based
upon version information on the grid resource. It is of the form
<basedir>/<condor-version>-<Globus canonicalsystemname>. An example
of the directory (without the base directory) for Condor version
6.1.13 running on a Sun Sparc machine with Solaris 2.6 is
6.1.13-sparc-sun-solaris-2.6.
-localdir dir
Specifies the directory on the remote grid resource in which to
create log and execution subdirectories needed by Condor. If limited
disk quota in the home or base directory on the grid resource is a
problem, set -localdir to a large temporary space, such as /tmpor
/scratch. If the batch system requires invocation of Condor daemons
in a temporary scratch directory, ’.’ may be used for the definition
of the -localdir option.
-arch architecture
Identifies the platform of the required tarball containing the
correct Condor daemon executables to download and install. If a
reasonable default can be discovered through MDS, this is optional.
A list of possible values may be found at
http://www.cs.wisc.edu/condor/glidein/binaries. The architecture
name is the same as the tarball name without the suffix tar.gz. An
example is 6.6.5-i686-pc-Linux-2.4 .
-queue name
The argument name is a string used at the grid resource to identify
a job queue.
-project name
The argument name is a string used at the grid resource to identify
a project name.
-memory MBytes
The maximum memory size in Megabytes to request from the grid
resource.
-count CPU count
The number of CPUs requested to join the local pool. The default is
1.
-slots slot count
For machines with multiple CPUs, the CPUs maybe divided up into
slots. slot count is the number of slots that results. By default,
Condor divides multiple-CPU resources such that each CPU is a slot,
each with an equal share of RAM, disk, and swap space. This option
configures the number of slots, so that multi-threaded jobs can run
in a slot with multiple CPUs. For example, if 4 CPUs are requested
and -slots is not specified, Condor will divide the request up into
4 slots with 1 CPU each. However, if -slots 2 is specified, Condor
will divide the request up into 2 slots with 2 CPUs each, and if
-slots 1 is specified, Condor will put all 4 CPUs into one slot.
-idletime minutes
The amount of time that a remote grid resource will remain idle
state, before the daemons shut down. A value of 0 (zero) means that
the daemons never shut down due to remaining in the idle state. In
this case, the -runtime option defines when the daemons shut down.
The default value is 20 minutes.
-runtime minutes
The maximum amount of time the Condor daemons on the remote grid
resource will run before shutting themselves down. This option is
useful for resources with enforced maximum run times. Setting
-runtime to be a few minutes shorter than the enforced limit gives
the daemons time to perform a graceful shut down.
-anybody
Sets the Condor STARTexpression for the added remote grid resource
to True. This permits any user’s job which can run on the added
remote grid resource to run. Without this option, only jobs owned by
the user executing condor_glidein can execute on the remote grid
resource. WARNING: Using this option may violate the usage policies
of many institutions.
-admin address
Where to send e-mail with problems. The default is the login of the
user running condor_glidein at UID domain of the local Condor pool.
-suffix X
Suffix to use when generating files. Default is process id.
-gsi_daemon_name cert_name
Includes and enables GSI authentication in the configuration for the
remote grid resource. The argument is the GSI certificate name that
the daemons will use to authenticate themselves.
-install_gsi_trusted_ca_dir path
The argument identifies the directory containing the trusted CA
certificates that the daemons are to use (for example, /etc/grid-
security/certificates). The contents of this directory will be
installed at the remote site in the directory <basedir>/grid-
security.
-install_gsi_gridmap file
The argument is the file name of the GSI-specific X.509 map file
that the daemons will use. The file will be installed at the remote
site in <basedir>/grid-security. The file contains entries mapping
certificates to user names. At the very least, it must contain an
entry for the certificate given by the command-line option
-gsi_daemon_name . If other Condor daemons use different
certificates, then this file will also list any certificates that
the daemons will encounter for the condor_schedd , condor_collector
, and condor_negotiator . See section for more information.
Exit Status
condor_glidein will exit with a status value of 0 (zero) upon complete
success, or with non-zero values upon failure. The status value will be
1 (one) if condor_glidein encountered an error making a directory, was
unable to copy a tar file, encountered an error in parsing the command
line, or was not able to gather required information. The status value
will be 2 (two) if there was an error in the remote set up. The status
value will be 3 (three) if there was an error in remote submission. The
status value will be -1 (negative one) if no resource was specified in
the command line.
Common problems are listed below. Many of these are best discovered by
looking in the StartLoglog file on the remote grid resource.
WARNING: The file xxx is not writable by condor
This error occurs when condor_glidein is run in a directory that
does not have the proper permissions for Condor to access files. An
AFS directory does not give Condor the user’s AFS ACLs.
Glideins fail to run due to GLIBC errors
Check the list of available glidein binaries
(http://www.cs.wisc.edu/condor/glidein/binaries), and try specifying
the architecture name that includes the correct glibc version for
the remote grid site.
Glideins join pool but no jobs run on them
One common cause of this problem is that the remote grid resources
are in a different file system domain, and the submitted Condor jobs
have an implicit requirement that they must run in the same file
system domain. See section for details on using Condor’s file
transfer capabilities to solve this problem. Another cause of this
problem is a communication failure. For example, a firewall may be
preventing the condor_negotiator or the condor_schedd daemons from
connecting to the condor_startd on the remote grid resource.
Although work is being done to remove this requirement in the
future, it is currently necessary to have full bidirectional
connectivity, at least over a restricted range of ports. See page
for more information on configuring a port range.
Glideins run but fail to join the pool
This may be caused by the local pool’s security settings or by a
communication failure. Check that the security settings in the local
pool’s configuration file allow write access to the remote grid
resource. To not modify the security settings for the pool, run a
separate pool specifically for the remote grid resources, and use
flocking to balance jobs across the two pools of resources. If the
log files indicate a communication failure, then see the next item.
The startd cannot connect to the collector
This may be caused by several things. One is a firewall. Another is
when the compute nodes do not have even outgoing network access.
Configuration to work without full network access to and from the
compute nodes is still in the experimental stages, so for now, the
short answer is that you must at least have a range of open
(bidirectional) ports and set up the configuration file as described
on page . Use the option -genconfig , edit the generated
configuration file, and then do the glidein execute task with the
option -useconfig .)
Another possible cause of connectivity problems may be the use of
UDP by the condor_startd to register itself with the
condor_collector . Force it to use TCP as described on page .
Yet another possible cause of connectivity problems is when the
remote grid resources have more than one network interface, and the
default one chosen by Condor is not the correct one. One way to fix
this is to modify the glidein startup script using the -genstartup
and -usestartup options. The script needs to determine the IP
address associated with the correct network interface, and assign
this to the environment variable _condor_NETWORK_INTERFACE.
NFS file locking problems
If the -localdir option uses files on NFS (not recommended, but
sometimes convenient for testing), the Condor daemons may have
trouble manipulating file locks. Try inserting the following into
the configuration file:
IGNORE_NFS_LOCK_ERRORS = True
Author
Condor Team, University of Wisconsin-Madison
Copyright
Copyright (C) 1990-2009 Condor Team, Computer Sciences Department,
University of Wisconsin-Madison, Madison, WI. All Rights Reserved.
Licensed under the Apache License, Version 2.0.
See the Condor Version 7.2.4 Manual or
http://www.condorproject.org/licensefor additional notices. condor-
admin@cs.wisc.edu
date just-man-pages/condor_glidein(1)