NAME
strigger - Used set, get or clear Slurm trigger information.
SYNOPSIS
strigger --set [OPTIONS...]
strigger --get [OPTIONS...]
strigger --clear [OPTIONS...]
DESCRIPTION
strigger is used to set, get or clear Slurm trigger information.
Triggers include events such as a node failing, a job reaching its time
limit or a job terminating. These events can cause actions such as the
execution of an arbitrary script. Typical uses include notifying
system administrators of node failures and gracefully terminating a job
when it’s time limit is approaching. A hostlist expression for the
nodelist or job ID is passed as an argument to the program.
Trigger events are not processed instantly, but a check is performed
for trigger events on a periodic basis (currently every 15 seconds).
Any trigger events which occur within that interval will be compared
against the trigger programs set at the end of the time interval. The
trigger program will be executed once for any event occuring in that
interval. The record of those events (e.g. nodes which went DOWN in
the previous 15 seconds) will then be cleared. The trigger program
must set a new trigger before the end of the next interval to insure
that no trigger events are missed. If desired, multiple trigger
programs can be set for the same event.
IMPORTANT NOTE: This command can only set triggers if run by the user
SlurmUser unless SlurmUser is configured as user root. This is
required for the slurmctld daemon to set the appropriate user and group
IDs for the executed program. Also note that the program is executed
on the same node that the slurmctld daemon uses rather than some
allocated compute node. To check the value of SlurmUser, run the
command:
scontrol show config | grep SlurmUser
ARGUMENTS
--block_err
Trigger an event when a BlueGene block enters an ERROR state.
--clear
Clear or delete a previously defined event trigger. The --id,
--jobid or --userid option must be specified to identify the
trigger(s) to be cleared.
-d, --down
Trigger an event if the specified node goes into a DOWN state.
-D, --drained
Trigger an event if the specified node goes into a DRAINED
state.
-F, --fail
Trigger an event if the specified node goes into a FAILING
state.
-f, --fini
Trigger an event when the specified job completes execution.
--get Show registered event triggers. Options can be used for
filtering purposes.
-i, --id=id
Trigger ID number.
-I, --idle
Trigger an event if the specified node remains in an IDLE state
for at least the time period specified by the --offset option.
This can be useful to hibernate a node that remains idle, thus
reducing power consumption.
-j, --jobid=id
Job ID of interest. NOTE: The --jobid option can not be used in
conjunction with the --node option. When the --jobid option is
used in conjunction with the --up or --down option, all nodes
allocated to that job will considered the nodes used as a
trigger event.
-n, --node[=host]
Host name(s) of interest. By default, all nodes associated with
the job (if --jobid is specified) or on the system are
considered for event triggers. NOTE: The --node option can not
be used in conjunction with the --jobid option. When the --jobid
option is used in conjunction with the --up, --down or --drained
option, all nodes allocated to that job will considered the
nodes used as a trigger event.
-o, --offset=seconds
The specified action should follow the event by this time
interval. Specify a negative value if action should preceded
the event. The default value is zero if no --offset option is
specified. The resolution of this time is about 20 seconds, so
to execute a script not less than five minutes prior to a job
reaching its time limit, specify --offset=320 (5 minutes plus 20
seconds).
-p, --program=path
Execute the program at the specified fully qualified pathname
when the event occurs. The program will be executed as the user
who sets the trigger. If the program fails to terminate within
5 minutes, it will be killed along with any spawned processes.
-Q, --quiet
Do not report non-fatal errors. This can be useful to clear
triggers which may have already been purged.
-r, --reconfig
Trigger an event when the system configuration changes.
--set Register an event trigger based upon the supplied options.
NOTE: An event is only triggered once. A new event trigger must
be set established for future events of the same type to be
processed.
-t, --time
Trigger an event when the specified job’s time limit is reached.
This must be used in conjunction with the --jobid option.
-u, --up
Trigger an event if the specified node is returned to service
from a DOWN state.
--user=user_name_or_id
Clear or get triggers associated with the specified user.
Specify either a user name or user ID.
-v, --verbose
Print detailed event logging. This includes time-stamps on data
structures, record counts, etc.
-V , --version
Print version information and exit.
OUTPUT FIELD DESCRIPTIONS
TRIG_ID
Trigger ID number.
RES_TYPE
Resource type: job or node
RES_ID Resource ID: job ID or host names or "*" for any host
TYPE Trigger type: time or fini (for jobs only), down or up (for jobs
or nodes), or drained, idle or reconfig (for nodes only)
OFFSET Time offset in seconds. Negative numbers indicated the action
should occur before the event (if possible)
USER Name of the user requesting the action
PROGRAM
Pathname of the program to execute when the event occurs
EXAMPLES
Execute the program "/usr/sbin/slurm_admin_notify" whenever any node in
the cluster goes down. The subject line will include the node names
which have entered the down state (passed as an argument to the script
by SLURM).
> cat /usr/sbin/slurm_admin_notify
#!/bin/bash
# Submit trigger for next event
strigger --set --node --down \
--program=/usr/sbin/slurm_admin_notify
# Notify administrator using by e-mail
/bin/mail slurm_admin@site.com -s NodesDown:$*
> strigger --set --node --down \
--program=/usr/sbin/slurm_admin_notify
Execute the program "/usr/sbin/slurm_suspend_node" whenever any node in
the cluster remains in the idle state for at least 600 seconds.
> strigger --set --node --idle --offset=600 \
--program=/usr/sbin/slurm_suspend_node
Execute the program "/home/joe/clean_up" when job 1234 is within 10
minutes of reaching its time limit.
> strigger --set --jobid=1234 --time --offset=-600 \
--program=/home/joe/clean_up
Execute the program "/home/joe/node_died" when any node allocated to
job 1234 enters the DOWN state.
> strigger --set --jobid=1234 --down \
--program=/home/joe/node_died
Show all triggers associated with job 1235.
> strigger --get --jobid=1235
TRIG_ID RES_TYPE RES_ID TYPE OFFSET USER PROGRAM
123 job 1235 time -600 joe /home/bob/clean_up
125 job 1235 down 0 joe /home/bob/node_died
Delete event trigger 125.
> strigger --clear --id=125
Execute /home/joe/job_fini upon completion of job 1237.
> strigger --set --jobid=1237 --fini --program=/home/joe/job_fini
COPYING
Copyright (C) 2007 The Regents of the University of California.
Produced at Lawrence Livermore National Laboratory (cf, DISCLAIMER).
CODE-OCEC-09-009. All rights reserved.
This file is part of SLURM, a resource management program. For
details, see <https://computing.llnl.gov/linux/slurm/>.
SLURM is free software; you can redistribute it and/or modify it under
the terms of the GNU General Public License as published by the Free
Software Foundation; either version 2 of the License, or (at your
option) any later version.
SLURM is distributed in the hope that it will be useful, but WITHOUT
ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License
for more details.
SEE ALSO
scontrol(1), sinfo(1), squeue(1)