mon - monitor services for availability, sending alarms upon failures.

NAME

       mon  - monitor services for availability, sending alarms upon failures.

SYNOPSIS

       mon [-dfhlMSv] [-a dir] [-A authfile] [-b dir] [-B dir] [-c config] [-D
       dir] [-i secs] [-k num] [-l [statetype]] [-L dir] [-m num] [-p num] [-P
       pidfile] [-r delay] [-s dir]

DESCRIPTION

       mon is a general-purpose scheduler for monitoring service  availability
       and  triggering alerts upon detecting failures.  mon was designed to be
       open in the sense that it supports arbitrary monitoring facilities  and
       alert  methods  via  a  common  interface, which are easily implemented
       through programs (in C, Perl, shell, etc.), SNMP traps, and special Mon
       (UDP packet) traps.

OPTIONS

       -a dir Path        to       alert       scripts.       Default       is
              /usr/local/lib/mon/alert.d:alert.d.  Multiple alert paths may be
              specified  by  separating them with a colon.  Non-absolute paths
              are taken to be relative to the base directory (/usr/lib/mon  by
              default).

       -b dir Base  directory  for  mon. scriptdir, alertdir, and statedir are
              all relative to this directory unless specified from /.  Default
              is /usr/lib/mon.

       -B dir Configuration  file base directory. All config files are located
              here, including mon.cf, monusers.cf, and auth.cf.

       -A authfile
              Authentication  configuration   file.   By   default   this   is
              /etc/mon/auth.cf   if   the   /etc/mon   directory   exists,  or
              /usr/lib/mon/auth.cf otherwise.

       -c file
              Read   configuration   from   file.    This   defaults   to   IR
              /etc/mon/mon.cf  " if the " /etc/mon directory exists, otherwise
              to /etc/mon.cf.

       -d     Enable debugging mode.

       -D dir Path  to   state   directory.    Default   is   the   first   of
              /var/state/mon,  /var/lib/mon,  and  /usr/lib/mon/state.d  which
              exists.

       -f     Fork and run as a daemon process. This is the preferred  way  to
              run mon.

       -h     Print help information.

       -i secs
              Sleep  interval,  in seconds. Defaults to 1. This shouldn’t need
              to be adjusted for any reason.

       -k num Set log history to a maximum of num entries. Defaults to 100.

       -l statetype
              Load state from the last saved state file. The  supported  saved
              state  types  are  disabled  for disabled watches, services, and
              hosts, opstatus for failure/alert/ack status  of  all  services,
              and  all  for  both.   If  no statetype is provided, disabled is
              assumed.

       -L dir Sets the log dir. See also logdir  in  the  configuration  file.
              The  default is /var/log/mon if that directory exists, otherwise
              log.d in the base directory.

       -M     Pre-process the configuration  file  with  the  macro  expansion
              package m4.

       -m num Set the throttle for the maximum number of processes to num.

       -p num Make server listen on port num.  This defaults to 2583.

       -S     Start with the scheduler stopped.

       -P pidfile
              Store  the  server’s pid in pidfile, the default is the first of
              /var/run/mon/mon.pid, /var/run/mon.pid, and  /etc/mon.pid  whose
              directory  exists.   An  empty  value tells mon not to use a pid
              file.

       -r delay
              Sets the number of seconds used to randomize the  startup  delay
              before  each service is scheduled. Refer to the global randstart
              variable in the configuration file.

       -s dir Path      to       monitor       scripts.       Default       is
              /usr/local/lib/mon/mon.d:mon.d.   Multiple  alert  paths  may be
              specified by separating them with a colon.   Non-absolute  paths
              are  taken to be relative to the base directory (/usr/lib/mon by
              default).

       -v     Print version information.

DEFINITIONS

       monitor
              A program which tests for a certain  condition,  returns  either
              true  or false, and optionally produces output to be passed back
              to the scheduler.  Common monitors detect host reachability  via
              ICMP echo messages, or connection to TCP services.

       period A period in time as interpreted by the Time::Period module.

       alert  A  program  which sends a message when invoked by the scheduler.
              The scheduler calls upon an alert when it detects a failure from
              a  monitor.   An  alert  program  accepts  a set of command-line
              arguments from the scheduler, in addition to data  via  standard
              input.

       hostgroup
              A  single  host  or  list  of  hosts,  specified  as names or IP
              addresses.

       service
              A collection of  parameters  used  to  deal  with  monitoring  a
              particular  resource  which is provided by a group. Services are
              usually modeled after things such as an SMTP server,  ICMP  echo
              capability, server disk space availability, or SNMP events.

       view   A collection of hostgroups, used to filter mon output for client
              display.  i.e. a ’network-services’ view  might  be  defined  so
              your  network  staff can see just the hostgroups which matter to
              them, without having to see all hostgroups defined in Mon.

       watch  A collection of services which apply to a particular group.

OPERATION

       When the mon  scheduler  starts,  it  reads  a  configuration  file  to
       determine  the  services  it  needs  to monitor. The configuration file
       defaults to /etc/mon.cf, and can be specified using the  -c  parameter.
       If  the  -M  option  is  specified, then the configuration file is pre-
       processed with m4.  If the configuration file ends with .m4,  the  file
       is also processed by m4 automatically.

       The  scheduler  enters a loop which handles client connections, monitor
       invocations, and failure alerts. Each service has a timer, specified in
       the  configuration  file  as  the  interval  variable,  which tells the
       scheduler how frequently to invoke a monitor  process.   The  scheduler
       may  be  temporarily  stopped. While it is stopped, client access still
       functions, but it just doesn’t  schedule  things.  This  is  useful  in
       conjunction  while  resetting the server, because you can do this: save
       the hosts and services which are disabled, reset the  server  with  the
       scheduler stopped, re-disabled those hosts and services, then start the
       scheduler. It also allows making atomic changes across  several  client
       connections.  See the moncmd man page for more information.

MONITOR PROGRAMS

       Monitor  processes  are  invoked  with  the  arguments specified in the
       configuration file, appended by the  hosts  from  the  applicable  host
       group.  For example, if the watch group is "servers", which contain the
       hostnames "smtp", "nntp", and "ns",  and  the  monitor  line  reads  as
       follows,

       monitor fping.monitor -t 4000 -r 2
       then  the  exectuable  "fping.monitor"  will  be  executed  with  these
       parameters:

       MONITOR_DIR/fping.monitor -t 4000 -r 2 smtp nntp ns

       MONITOR_DIR    is    actually    a    search    path,    by     default
       /usr/local/lib/mon/mon.d   then   /usr/lib/mon/mon.d,  but  it  can  be
       overridden by the -s option or in the configuration file.  If all hosts
       in  the  hostgroup have been disabled, then a warning is sent to syslog
       and the monitor is not run. This behavior may be  overridden  with  the
       "allow_empty_group"  option  in  the  service definition.  If the final
       argument to the  "monitor"  line  is  ";;"  (it  must  be  preceded  by
       whitespace),  then  the host list will not be appended to the parameter
       list.

       In addition to environment variables defined by the user in the service
       definition, mon passes certain variables to monitor process.

       MON_LAST_SUMMARY
              The  first  line  of  the  output from the last time the monitor
              exited.  This is not the summary of the current monitor run, but
              the  previous  one.   This  may  be  used  by an alert script to
              provide historical context in an alert.

       MON_LAST_OUTPUT
              The entire output of the monitor from the last time  it  exited.
              This  is  not  the  output  of  the current monitor run, but the
              previous one.  This may be used by an alert  script  to  provide
              historical context in an alert.

       MON_LAST_FAILURE
              The time(2) of the last failure for this service.

       MON_FIRST_FAILURE
              The time(2) of the first time this service failed.

       MON_LAST_SUCCESS
              The time(2) of the last time this service passed.

       MON_DESCRIPTION
              The description of this service, as defined in the configuration
              file using the description tag.

       MON_DEPEND_STATUS
              The depend status, "o" if dependency failure, "1" otherwise.

       MON_LOGDIR
              The directory log files should be placed, as  indicated  by  the
              logdir global configuration variable.

       MON_STATEDIR
              The  directory where state files should be kept, as indicated by
              the statedir global configuration variable.

       MON_CFBASEDIR
              The directory where  configuration  files  should  be  kept,  as
              indicated by the cfbasedir global configuration variable.

       "fping.monitor"  should  return  an  exit  status  of 0 if it completed
       successfully (found no problems), or nonzero if a problem was detected.
       The first line of output from the monitor script has a special meaning:
       it is used as a brief summary of the exact failure which was  detected,
       and is passed to the alert program. All remaining output is also passed
       to the alert program, but it has no required interpretation.

       If a monitor for a particular service is still running,  and  the  time
       comes  for  mon  to  run  another monitor for that service, it will not
       start another monitor. For example, if the interval  is  10s,  and  the
       monitor  does  not finish running within 10 seconds, then mon will wait
       until the first monitor exits before running another one.

ALERT DECISION LOGIC

       Upon a non-zero or zero exit status, the associated  alert  or  upalert
       program (respectively) is started, pending the following conditions: If
       an alert for a specific service is disabled, do not send an alert.   If
       dep_behavior  is  set  to  a,  or  alertdepend  is  set, and a parent
       dependency is failing, then suppress  the  alert.   If  the  alert  has
       previously  been  acknowledged,  do not send the alert, unless it is an
       upalert.  If an alert is not within the specified  period,  record  the
       failure  via  syslog(3)  and do not send an alert.  If the failure does
       not fall within a defined period, do not send an  alert.   No  upalerts
       are  sent  without  corresponding down alerts, unless no_comp_alerts is
       defined in the period section. An upalert will  only  be  sent  if  the
       previous  state  is a failure.  If an alert was already sent within the
       last alertevery interval, do not send another alert, unless the summary
       output  from  the current monitor program differs from the last monitor
       process.  Otherwise, send an alert using each alert program listed  for
       that  period.  The  observe_detail  argument to alertevery affects this
       behavior by observing the changes in the detail part of the  output  in
       addition to the summary line.  If a monitor has successive failures and
       the summary output  changes  in  each  of  them,  alertevery  will  not
       suppress  multiple  consecutive  alerts.   The reasoning is that if the
       summary output changes, then a significant event occurred and the  user
       should  be  alerted.  The "strict" argument to alertevery will suppress
       both comparing the output from the previous monitor run to the  current
       and prevent a successful return value of the monitor from resetting the
       alertevery timer. For example, "alertevery 24h strict" will  only  send
       out  an  alert  once  every 24 hours, regardless of whether the monitor
       output changes, or if the service stops and then starts failing.

ALERT PROGRAMS

       Alert programs are found in the path supplied with the -a parameter, or
       in  the  /usr/local/lib/mon/alert.d  and  directories if not specified.
       They are invoked with the following command-line parameters:

       -s service
              Service tag from the configuration file.

       -g group
              Host group name from the configuration file.

       -h hosts
              The expanded version of the host  group,  space  delimited,  but
              contained in one shell "word".

       -l alertevery
              The number of seconds until the next alarm will be sent.

       -O     This  option   is   supplied   to  an alert only if the alert is
              being generated as a result of an expected traap timing out

       -t time
              The time (in time(2) format) of when this failure condition  was
              detected.

       -T     This  option  is  supplied  to  an  alert  only if the alert was
              triggered by a trap

       -u     This option is supplied to an alert only if it is  being  called
              as an upalert.

       The  remaining  arguments  are supplied from the trailing parameters in
       the configuration file, after the "alert" service parameter.

       As with monitor programs, alert programs are invoked  with  environment
       variables defined by the user in the service definition, in addition to
       the following which are explicitly set by the server:

       MON_LAST_SUMMARY
              The first line of the output from  the  last  time  the  monitor
              exited.

       MON_LAST_OUTPUT
              The entire output of the monitor from the last time it exited.

       MON_LAST_FAILURE
              The time(2) of the last failure for this service.

       MON_FIRST_FAILURE
              The time(2) of the first time this service failed.

       MON_LAST_SUCCESS
              The time(2) of the last time this service passed.

       MON_DESCRIPTION
              The description of this service, as defined in the configuration
              file using the description tag.

       MON_GROUP
              The watch group which triggered this alarm

       MON_SERVICE
              The service heading which generated this alert

       MON_RETVAL
              The exit value of the failed monitor program, or return value as
              accepted from a trap.

       MON_OPSTATUS
              The operational status of the service.

       MON_ALERTTYPE
              Has  one  of  the  following values: "failure", "up", "startup",
              "trap", or "traptimeout", and signifies the type of alert  which
              was triggered.

       MON_TRAP_INTENDED
              This is only set when an unknown mon trap is received and caught
              by  the  default/defaut  watch/service.  This   contains   colon
              separated entries of the trap’s intended watch group and service
              name.

       MON_LOGDIR
              The directory log files should be placed, as  indicated  by  the
              logdir global configuration variable.

       MON_STATEDIR
              The  directory where state files should be kept, as indicated by
              the statedir global configuration variable.

       MON_CFBASEDIR
              The directory where  configuration  files  should  be  kept,  as
              indicated by the cfbasedir global configuration variable.

       The  first  line from standard input must be used as a brief summary of
       the problem, normally supplied as the subject line of an email, or text
       sent  to  an alphanumeric pager. Interpretation of all subsequent lines
       read from  stdin  is  left  up  to  the  alerting  program.  The  usual
       parameters  are  a  list  of recipients to deliver the notification to.
       The interpretation of the recipients is not specified, and is up to the
       alert program.

CONFIGURATION FILE

       The  configuration  file  consists  of  zero  or  more  global variable
       definitions, zero or more hostgroup definitions, and one or more  watch
       definitions.  Each  watch  definition  may  have  one  or  more service
       definitions. A watch definition is terminated by a blank line,  another
       definition,  or  the  end  of  the file. A line beginning with optional
       leading whitespace and a pound ("#") is regarded as a comment,  and  is
       ignored.

       Lines  are  parsed  as  they  are  read. Long lines may be continued by
       ending them with a backslash ("\").  If a line is continued,  then  the
       backslash, the trailing whitespace after the backslash, and the leading
       whitespace of the  following  line  are  removed.  The  end  result  is
       assembled into a single line.

       Typically the configuration file has the following layout:

       1. Global variable definitions

       2. Hostgroup definitions

       3. Watch definitions

       See  the  "etc/example.cf" file which comes for the distribution for an
       example.

   Global Variables
       The following variables may be set to  override  compiled-in  defaults.
       Command-line   options   will  have  a  higher  precedence  than  these
       definitions.

       alertdir = dir
              dir is the full path to the alert scripts. This is the value set
              by the -a command-line parameter.

              Multiple  alert paths may be specified by separating them with a
              colon.  Non-absolute paths are taken to be relative to the  base
              directory (/usr/lib/mon by default).

              When  the configuration file is read, all alerts referenced from
              the configuration will be looked up in each of these paths,  and
              the full path to the first instance of the alert found is stored
              in a hash. This hash is only generated upon startup or  after  a
              "reset"  command,  so  newly  added  alert  scripts  will not be
              recognized until a "reset" is performed.

       mondir = dir
              dir is the full path to the monitor scripts. This value may also
              be  set  by the -s command-line parameter. If this path does not
              begin with a "/", it will be relative to basedir.

              Multiple alert paths may be specified by separating them with  a
              colon. All paths must be absolute.

              When  the  configuration  file  is read, all monitors referenced
              from the configuration will be looked up in each of these paths,
              and  the full path to the first instance of the monitor found is
              stored in a hash. This hash is only generated  upon  startup  or
              after a "reset" command, so newly added monitor scripts will not
              be recognized until a "reset" is performed.

       statedir = dir
              dir is the full path to the  state  directory.   mon  uses  this
              directory  to  save various state information. If this path does
              not begin with a "/", it will be relative to basedir.

       logdir = dir
              dir is the full path  to  the  log  directory.   mon  uses  this
              directory  to  save various logs, including the downtime log. If
              this path does not begin with a "/",  it  will  be  relative  to
              basedir.

       basedir = dir
              dir  is  the  full  path  for the state, log, monitor, and alert
              directories.

       cfbasedir = dir
              dir is the full path where all the config  files  can  be  found
              (monusers.cf, auth.cf, etc.).

       authfile = file
              file  is  the  path to the authentication file. If the path does
              not begin with a "/", it will be relative to cfbasedir.

       authtype = type [type...]
              type is the type of authentication  to  use.  A  space-separated
              list  of  types  may  be specified, and they will be checked the
              order they are listed. As soon as a successful authentication is
              performed,  the  user is considered authenticated by mon for the
              duration of the session and no more  authentication  checks  are
              performed.

              If  type  is  getpwnam,  then  the  standard  Unix  passwd  file
              authentication method will be used  (calls  getpwnam(3)  on  the
              user  and  compares  the crypt(3)ed version of the password with
              what it gets from  getpwnam).  This  will  not  work  if  shadow
              passwords are enabled on the system.

              If  type  is  userfile,  then usernames and hashed passwords are
              read  from  userfile,  which  is  defined   via   the   userfile
              configuration variable.

              If type is pam, then PAM (pluggable authentication modules) will
              be used  for  authentication.   The  service  specified  by  the
              pamservice  global  will be used. If no global is given, the PAM
              passwd service will be used.

              If type is trustlocal, then if the client connection comes  from
              locahost,  the  username passed from the client will be trusted,
              and the password will be ignored.  This can  be  used  when  you
              want  the  client  to handle the authentication for you.  I.e. a
              CGI script using one of the many apache authentication  methods.

       userfile = file
              This file is used when authtype is set to userfile.  It consists
              of a sequence of lines of  the  format  ’username  :  password’.
              password  is  stored  as  the hash returned by the standard Unix
              crypt(3) function.  NOTE: the format of this file is  compatible
              with  the Apache file based username/password file format. It is
              possible to use the htpasswd program  supplied  with  Apache  to
              manage the mon userfile.

              Blank lines and lines beginning with # are ignored.

       pamservice = service
              The PAM service used for authentication. This is applicable only
              if "pam" is specified as a parameter to the authtype setting. If
              this global is not defined, it defaults to passwd.

       serverbind = addr

       trapbind = addr

              serverbind and trapbind specify which address to bind the server
              and trap ports to, respectively.  If these are not defined,  the
              default  address  is INADDR_ANY, which allows connections on all
              interfaces. For security reasons, it could be  a  good  idea  to
              bind only to the loopback interface.

       dtlogfile = file
              file  is  a  file which will be used to record the downtime log.
              Whenever a service fails for some amount of time and  then  stop
              failing,  this event is written to the log. If this parameter is
              not set, no logging is done.  The  format  of  the  file  is  as
              follows (# is a comment and may be ignored):

              timenoticed group service firstfail downtime interval summary.

              timenoticed is the time(2) the service came back up.

              group service is the group and service which failed.

              firstfail is the time(2) when the service began to fail.

              downtime is the number of seconds the service failed.

              interval  is  the  frequency  (in  seconds)  that the service is
              polled.

              summary is the summary line from when the service was failing.

       monerrfile = filename
              By default, when mon daemonizes itself, it connects  stdout  and
              stderr to /dev/null. If monerrfile is set to a file, then stdout
              and stderr will be appended to that file. In all cases stdin  is
              connected  to /dev/null. If mon is told to run in the foreground
              and  to  not  daemonize,  then  none  of  this  applies,   since
              stdin/stdout/stderr  stay connected to whatever they were at the
              time of invocation.

       dtlogging = yes/no

              Turns downtime logging on or off. The default is off.

       histlength = num
              num is the the maximum  number  of  events  to  be  retained  in
              history list. The default is 100.  This value may also be set by
              the -k command-line parameter.

       historicfile = file
              If this variable is set, then alerts are  logged  to  file,  and
              upon  startup,  some  (or  all) of the past history is read into
              memory.

       historictime = timeval
              num is the amount of the history  file  to  read  upon  startup.
              "Now"  - timeval is read. See the explanation of interval in the
              "Service Definitions" section for a description of timeval.

       serverport = port
              port is the TCP port number that the server should bind to. This
              value may also be set by the -p command-line parameter. Normally
              this port is looked up via getservbyname(3), and it defaults  to
              2583.

       trapport = port
              port is the UDP port number that the trap server should bind to.
              Normally this port is looked up  via  getservbyname(3),  and  it
              defaults to 2583.

       pidfile = path
              path  is  the  file the sever will store its pid in.  This value
              may also be set by the -P command-line parameter.

       maxprocs = num
              Throttles the number of concurrently forked  processes  to  num.
              The intent is to provide a safety net for the unlikely situation
              when the server tries to take on too many tasks at  once.   Note
              that this situation has only been reported to happen when trying
              to use a garbled configuration file! You don’t  want  to  use  a
              garbled configuration file now, do you?

       cltimeout = secs
              Sets  the  client  inactivity timeout to secs.  This is meant to
              help thwart denial of service attacks or  recover  from  crashed
              clients.  secs is interpreted as a "1h/1m/1s" string, where "1m"
              = 60 seconds.

       randstart = interval
              When the server  starts,  normally  all  services  will  not  be
              scheduled  until  the interval defined in the respective service
              section.  This can cause long delays before the first check of a
              service,  and  possibly  a  high  load on the server if multiple
              things are scheduled at the same intervals.  This option is used
              to  randomize  the scheduling of the first test for all services
              during the startup  period,  and  immediately  after  the  reset
              command.  If randstart is defined, the scheduled run time of all
              services of all watch groups will be  a  random  number  between
              zero and randstart seconds.

       dep_recur_limit = depth
              Limit  dependency  recursion  level  to  depth.   If  dependency
              recursion (dependencies  which  depend  on  other  dependencies)
              tries  to  go  beyond depth, then the recursion is aborted and a
              messages is logged to syslog.  The default limit is 10.

       dep_behavior = {a|m|hm}
              dep_behavior  controls   whether   the   dependency   expression
              suppresses  one  of:  the  running  of  alerts,  the  running of
              monitors, or the passing of individual hosts  to  the  monitors.
              Read  more  about  the  behavior  in  the  "Service Definitions"
              section below.

              This is a global setting which controls the default settings for
              the service-specified variable.

       dep_memory = timeval
              If  set,  dep_memory  will  cause  dependencies  to  continue to
              prevent alerts/monitoring for a period of time after the service
              returns  to  a  normal state.  This can be used to prevent over-
              eager alerting when a machine is rebooting,  for  example.   See
              the explanation of interval in the "Service Definitions" section
              for a description of timeval.

              This is a global setting which controls the default settings for
              the service-specified variable.

       syslog_facility = facility
              Specifies  the  syslog facility used for logging.  daemon is the
              default.

       startupalerts_on_reset = {yes|no}

              If set to "yes", startupalerts will be invoked  when  the  reset
              client command is executed. The default is "no".

       monremote = program

              If set, this external program will be called by Mon when various
              client requests are processed.  This can be  used  to  propagate
              those  changes  from  one  Mon  server  to  another, if you have
              multiple monitoring machines.  An example  script,  monremote.pl
              is available in the clients directory.

   Hostgroup Entries
       Hostgroup entries begin with the keyword hostgroup, and are followed by
       a hostgroup tag and one or more hostnames or IP addresses, separated by
       whitespace.   The  hostgroup  tag  must  be  composed  of  alphanumeric
       characters, a dash ("-"), a period ("."), or an underscore ("_").  Non-
       blank  lines following the first hostgroup line are interpreted as more
       hostnames.  The hostgroup  definition  ends  with  a  blank  line.  For
       example:

              hostgroup servers nameserver smtpserver nntpserver
                   nfsserver httpserver smbserver

              hostgroup router_group cisco7000 agsplus

   View Entries
       View  entries  begin  with the keyword view, and are followed by a view
       tag and the names of one or more hostgroups.   The  view  tag  must  be
       composed  of  alphanumeric characters, a dash ("-"), a period ("."), or
       an underscore ("_"). Non-blank lines following the first view line  are
       interpreted  as  more hostgroup names.  The view definition ends with a
       blank line. For example:

              view servers dns-servers web-servers file-servers
                   mail-servers

              view network-services routers switches vpn-servers

   Watch Group Entries
       Watch entries begin with a line that starts  with  the  keyword  watch,
       followed  by  whitespace  and  a single word which normally refers to a
       pre-defined hostgroup. If the  second  word  is  not  recognized  as  a
       hostgroup  tag,  a new hostgroup is created whose tag is that word, and
       that word is its only member.

       Watch entries consist of one or more service definitions.

       A watch group is terminated by a blank line, the end of the file, or by
       a subsequent definition, "watch", "hostgroup", or otherwise.

       There may be a special watch group entry called "default". If a default
       watch group is defined with a service entry named "default", then  this
       definition  will be used in handling traps received for an unrecognized
       watch and service.

   Service Definitions
       service servicename
              A service definition begins with they keyword  service  followed
              by  a word which is the tag for this service.  This word must be
              unique among all services defined for the same watch group.

              The components of a service are an interval, monitor, and one or
              more time period definitions, as defined below.

              If  a  service name of "default" is defined within a watch group
              called  "dafault"  (see   above),   then   the   default/default
              definition will be used for handling unknown mon traps.

              The  following configuration parameters are valid only following
              a service definition:

       VARIABLE=value
              Environment variables may be defined  for  each  service,  which
              will  be  included  in  the  environment of monitors and alerts.
              Variables must be specified in all capital letters,  must  begin
              with  an alphabetical character or an underscore, and there must
              be no spaces to the left of the equal sign.

       interval timeval
              The keyword interval followed by  a  time  value  specifies  the
              frequency  that a monitor script will be triggered.  Time values
              are defined as "30s", "5m", "1h", or "1d", meaning 30 seconds, 5
              minutes,  1  hour,  or  1  day.  The  numeric  portion  may be a
              fraction, such as "1.5h" or an hour and a half. This format of a
              time specification will be referred to as timeval.

       failure_interval timeval
              Adjusts  the  polling interval to timeval when the service check
              is failing. Resets the interval to the original when the service
              succeeds.

       traptimeout timeval
              This  keyword  takes  the  same  time  specification argument as
              interval, and makes the service expect a trap from  an  external
              source  at  least that often, else a failure will be registered.
              This is used for a heartbeat-style service.

       trapduration timeval
              If a trap is received, the status of the service  the  trap  was
              delivered  to  will normally remain constant. If trapduration is
              specified, the status of the service will remain  in  a  failure
              state for the duration specified by timeval, and then it will be
              reset to "success".

       randskew timeval
              Rather than schedule the monitor script to run at the  start  of
              each  interval,  randomly  adjust  the interval specified by the
              interval parameter by plus-or-minus randskew .  The  skew  value
              is specified as the interval parameter: "30s", "5m", etc...  For
              example if interval is 1m, and randskew is "5s", then  mon  will
              schedule  the  monitor script some time between every 55 seconds
              and 65 seconds.  The intent is to help distribute  the  load  on
              the  server  when  many  services  are  scheduled  at  the  same
              intervals.

       monitor monitor-name [arg...]
              The keyword monitor followed by  a  script  name  and  arguments
              specifies  the monitor to run when the timer expires. Shell-like
              quoting conventions are followed when specifying  the  arguments
              to  send  to the monitor script.  The script is invoked from the
              directory given with the -s argument, and  all  following  words
              are  supplied  as  arguments to the monitor program, followed by
              the list of hosts in the group referred to by the current  watch
              group.   If  the monitor line ends with ";;" as a separate word,
              the host groups are not appended to the argument list  when  the
              program is invoked.

       allow_empty_group
              The  allow_empty_group option will allow a monitor to be invoked
              even when the hostgroup for  that  watch  is  empty  because  of
              disabled  hosts.  The  default  behavior  is  not  to invoke the
              monitor when all hosts in a hostgroup have been disabled.

       description descriptiontext
              The text following description is queried  by  client  programs,
              passed  to  alerts  and monitors via an environment variable. It
              should contain a brief description of the service, suitable  for
              inclusion in an email or on a web page.

       exclude_hosts host [host...]
              Any  hosts  listed after exclude_hosts will be excluded from the
              service check.

       exclude_period periodspec
              Do not run a scheduled monitor during  the  time  identified  by
              periodspec.

       depend dependexpression
              The  depend  keyword is used to specify a dependency expression,
              which evaluates to either true of false, in the  boolean  sense.
              Dependencies  are  actual  Perl  expressions,  and must obey all
              syntactical rules. The expressions are evaluated  in  their  own
              package space so as to not accidentally have some unwanted side-
              effect.   If  a  syntax  error  is  found  when  evaluating  the
              expression, it is logged via syslog.

              Before evaluation, the following substitutions on the expression
              occur: phrases which look like "group:service"  are  substituted
              with  the  value  of  the  current  operational  status  of that
              specified service. These  opstatus  substitutions  are  computed
              recursively, so if service A depends upon service B, and service
              B depends upon service C, then service A depends upon service C.
              Successful  operational  statuses  (which  evaluate  to "1") are
              "STAT_OK",     "STAT_COLDSTART",      "STAT_WARMSTART",      and
              "STAT_UNKNOWN".   The  word "SELF" (in all caps) can be used for
              the group (e.g. "SELF:service"), and is an abbreviation for  the
              current watch group.

              This  feature  can  be used to control alerts for services which
              are dependent on other services, e.g.  an  SMTP  test  which  is
              dependent upon the machine being ping-reachable.

       dep_behavior {a|m|hm}
              The evaluation of the dependency graphs specified via the depend
              keyword  can  control  the  suppression  of  alert  or   monitor
              invocations,  or  the  suppression of individual hosts passed to
              the monitor.

              Alert suppression.  If this option  is  set  to  "a",  then  the
              dependency  expression  will  be evaluated after the monitor for
              the service exits or after a trap is received.   An  alert  will
              only  be  sent  if the evaluation succeeds, meaning that none of
              the nodes in the dependency graph indicate failure.

              Monitor suppression.  If it is set to "m", then  the  dependency
              expression  will be evaulated before the monitor for the service
              is about to run.  If the evaulation succeeds, then  the  monitor
              will  be  run.  Otherwise,  the  monitor will not be run and the
              status of the service will remain the same.

              Host suppression.  If it is set to "hm" then  Mon  will  extract
              the  list  of  "parent" services from the dependency expression.
              (In fact the expression can be just a list  of  services.)  Then
              when  the  monitor  for the service is about to be run, for each
              host in the current hostgroup Mon will  search  all  the  parent
              services  which  are currently failing and look for the hostname
              in the current summary output.  If the hostname is  found,  this
              host will be excluded from this run of the monitor.  This can be
              used to e.g. allow an SMTP test on a group of hosts to still  be
              run  even  when a single host is not ping-reachable.  If all the
              rest of the hosts are working fine, the service will be in an OK
              state,  but  if  another  host fails the SMTP test Mon can still
              alert about that host even  though  the  parent  dependency  was
              failing.  The dependency expression will not be used recursively
              in this case.

       alertdepend dependexpression

       monitordepend dependexpression

       hostdepend dependexpression
              These  keywords  allow  you  to  specify   multiple   dependency
              expressions  of  different  types.   Each one corresponds to the
              different dep_behavior settings  listed  above.   They  will  be
              evaluated  independently  in  the  different  contexts as listed
              above.  If depend is  present,  it  takes  precedence  over  the
              matching keyword, depending on the dep_behavior setting.

       dep_memory timeval
              If  set,  dep_memory  will  cause  dependencies  to  continue to
              prevent alerts/monitoring for a period of time after the service
              returns  to  a  normal state.  This can be used to prevent over-
              eager alerting when a machine is rebooting,  for  example.   See
              the explanation of interval in the "Service Definitions" section
              for a description of timeval.

       redistribute alert [arg...]
              A service may have one redistribute option, which is  a  special
              form  of  an  an alert definition.  This alert will be called on
              every service status  update,  even  sequential  success  status
              updates.   This  can  be  used  to  integrate  Mon  with another
              monitoring system, or to link together multiple Mon servers  via
              an  alert  script  that  generates  Mon  traps.   See the "ALERT
              PROGRAMS" section above for a list of the  parameters  mon  will
              pass automatically to alert programs.

       unack_summary
              Remove  the  "acknowledged"  state from a service if the summary
              component of the failure message changes.  In most common  usage
              the summary is the list of hosts that are failing, so additional
              hosts failing would remove an ack.

   Period Definitions
       Periods are used to define the conditions which should allow alerts  to
       be delivered.

       period [label:] periodspec
              A  period  groups one or more alarms and variables which control
              how often an alert happens when there is a failure.  The  period
              definition has two forms. The first takes an argument which is a
              period specification from Patrick  Ryan’s  Time::Period  Perl  5
              module. Refer to "perldoc Time::Period" for more information.

              The   second   form  requires  a  label  followed  by  a  period
              specification, as defined above. The label is a  tag  consisting
              of  an  alphabetic  character  or underscore followed by zero or
              more alphanumerics or underscores and ending with a colon.  This
              form  allows  multiple  periods with the same period definition.
              One use is to have a period definition which has  no  alertafter
              or  alertevery  parameters  for  a  particular  time period, and
              another for the same time period with a different set of  alerts
              that does contain those parameters.

              Period  definitions, in either the first or second form, must be
              unique within each service definition. For example, if you  need
              to  define two periods both for "wd {Sun-Sat}", then one or both
              of the period definitions must specify a label such  as  "period
              t1: wd {Sun-Sat}" and "period t2: wd {Sun-Sat}".

       alertevery timeval [observe_detail | strict]
              The  alertevery  keyword  (within a period definition) takes the
              same type of argument as the interval variable, and  limits  the
              number  of  times an alert is sent when the service continues to
              fail.  For example, if the  interval  is  "1h",  then  only  the
              alerts  in  the period section will only be triggered once every
              hour. If the alertevery keyword is omitted in a period entry, an
              alert  will  be  sent  out  every time a failure is detected. By
              default, if  the  summary  output  of  two  successive  failures
              changes,  then  the  alertevery  interval  is overridden, and an
              alert will be sent.  If the string "observe_detail" is the  last
              argument,  then both the summary and detail output lines will be
              considered when comparing the output of successive failures.  If
              the string "strict" is the last argument, then the output of the
              monitor or the state change of the service will have  no  effect
              on  when  alerts are sent. That is, "alertevery 24h strict" will
              send only one alert every 24  hours,  no  matter  what.   Please
              refer  to  the  ALERT  DECISION  LOGIC  section  for  a detailed
              explanation of how alerts are suppressed.

       alertafter num

       alertafter num timeval

       alertafter timeval
              The alertafter keyword  (within  a  period  section)  has  three
              forms:  only  with the "num" argument, or with the "num timeval"
              arguments, or only with the "timeval" argument.   In  the  first
              form,  an  alert  will  only  be invoked after "num" consecutive
              failures.

              In the  second  form,  the  arguments  are  a  positive  integer
              followed  by  an interval, as described by the interval variable
              above.  If these parameters are specified, then the  alerts  for
              that  period will only be called after that many failures happen
              within that interval. For example, if alertafter  is  given  the
              arguments  "3 30m",  then the alert will be called if 3 failures
              happen within 30 minutes.

              In the third form, the argument is an interval, as described  by
              the  interval  variable above.  Alerts for that period will only
              be called if the service has been in a failure  state  for  more
              than  the length of time desribed by the interval, regardless of
              the number of failures noticed within that interval.

       numalerts num

              This variable tells the server to call no more than  num  alerts
              during  a  failure.  The  alert  counter is kept on a per-period
              basis, and is reset upon each success.

       no_comp_alerts

              If this option  is  specified,  then  upalerts  will  be  called
              whenever  the  service  state  changes  from failure to success,
              rather than only after a corresponding "down" alert.

       alert alert [arg...]
              A period may contain multiple alerts, which are  triggered  upon
              failure  of  the  service.  An alert is specified with the alert
              keyword, followed by an optional exit parameter,  and  arguments
              which  are  interpreted  the same as the monitor definition, but
              without the ";;" exception. The exit parameter takes the form of
              exit=x  or  exit=x-y  and  has the effect that the alert is only
              called if the exit status of the monitor script falls within the
              range  of the exit parameter. If, for example, the alert line is
              alert exit=10-20 mail.alert mis then  mail-alert  will  only  be
              invoked  with mis as its arguments if the monitor program’s exit
              value is between 10 and 20. This feature allows you  to  trigger
              different  alerts  at  different severity levels (like when free
              disk space goes from 8% to 3%).

              See  the  ALERT  PROGRAMS  section  above  for  a  list  of  the
              pramaeters mon will pass automatically to alert programs.

       upalert alert [arg...]
              An  upalert is the compliment of an alert.  An upalert is called
              when a services makes  the  state  transition  from  failure  to
              success,  if  a  corresponding "down" alert was previously sent.
              The upalert script is called supplying the  same  parameters  as
              the alert script, with the addition of the -u parameter which is
              simply used to let an alert script know that it is being  called
              as  an  upalert.  Multiple  upalerts  may  be specified for each
              period definition.  Set the per-period no_comp_alerts option  to
              send  an upalert regardless if whether or not a "down" alert was
              sent.

       startupalert alert [arg...]
              A startupalert  is  only  called  when  the  mon  server  starts
              execution,  or  when a "reset" command was issued to the server,
              depending on the setting of the  startupalerts_on_reset  global.
              Unlike  other alerts, startupalerts are not called following the
              exit of a monitor, i.e. they are  called  in  their  own  right,
              therefore   the   "exit="   argument   is   not   applicable  to
              startupalert.

       upalertafter timeval
              The upalertafter parameter is specified as a string that follows
              the  syntax  of  the interval parameter ("30s", "1m", etc.), and
              controls the triggering of an upalert.  If a service comes  back
              up  after  being  down  for  a time greater than or equal to the
              value of this option, an upalert will be called. Use this option
              to  prevent  upalerts  to  be  called  because of "blips" (brief
              outages).

AUTHENTICATION CONFIGURATION FILE

       The file specified by the authfile variable in the  configuration  file
       (or  passed  via  the  -A parameter) will be loaded upon startup.  This
       file defines restrictions upon which client commands may be executed by
       which  users.  It  is  a  text file which consists of comments, command
       definitions, and trap authentication parameters.  A comment line begins
       with  optional  whitespace  followed  by  pound  sign.  Blank lines are
       ignored.

       The file is separated into  a  command  section  and  a  trap  section.
       Sections are specified by a single line containing one of the following
       statements:

                   command section

       or

                   trap section

       Lines following one of the above statements apply to that section until
       either the end of the file or another section begins.

       A  command  definition  consists  of  a  command,  followed by a colon,
       followed by a  comma-separated  list  of  users  who  may  execute  the
       command.   The default is that no users may execute any commands unless
       they are explicitly allowed in this configuration file. For clarity,  a
       user  can  be  denied  by prefixing the user name with "!". If the word
       "AUTH_ANY" is used for a username, then any authenticated user will  be
       allowed  to  execute  the  command.  If  the  word  "all" is used for a
       username, then that command may be executed by any user,  authenticated
       or not.

       The  trap  section  allows  configuration of which users may send traps
       from which hosts. The syntax is a source host  (name  or  ip  address),
       whitespace,  a  username, whitespace, and a plaintext password for that
       user. If the source host is "*", then allow traps from any host. If the
       username  is  "*", then accept traps without regard for the username or
       password. If no hosts or users are specified, then  no  traps  will  be
       accepted.

       An example configuration file:

              command section
              list:          all
              reset:         root,admin
              loadstate:          root
              savestate:          root

              trap section
              127.0.0.1 root r@@tp4sswrd

       This  means  that  all  clients  are  able to perform the list command,
       "root" is  able  to  perform  "reset",  "loadstate",  "savestate",  and
       "admin" is able to execute the "reset" command.

CLIENT-SERVER INTERFACE

       The  server listens on TCP port 2583, which may be overridden using the
       -p port option. Commands are  a  single  line  each,  terminated  by  a
       newline.   The  server  can  handle  any  number of simultaneous client
       connections.

CLIENT INTERFACE COMMANDS

       See manual page for moncmd.

MON TRAPPING

       Mon has the facility to receive special "mon traps" from any  local  or
       remote  machine.  Currently,  the only available method for sending mon
       traps are through the Mon::Client perl interface, though the UDP packet
       format  is  defined well enough to permit the writing of traps in other
       languages.

       Traps are handled similarly to monitors: a trap  sends  an  operational
       status,  summary line, and description text, and mon generates an alert
       or upalert as necessary.

       Traps can be caught by any  watch/service  group  set  up  in  the  mon
       configuration   file,  however  it  is  suggested  that  you  configure
       watch/service groups specifically for the traps you expect to  receive.
       When defining a special watch/service group for traps, do not include a
       "monitor" directive (as no monitor need be invoked). Since a monitor is
       not being invoked, it is not necessary for the watch definition to have
       a hostgroup which contains real host names.   Just  make  up  a  useful
       name, and mon will automatically create the watch group for you.

       Here is a simple config file example:

              watch trap-service
                   service host1-disks
                        description TRAP: for host1 disk status
                        period wd {Sun-Sat}
                             alert mail.alert someone@your.org
                             upalert mail.alert -u someone@your.org

       Since  mon  listens  on  a UDP port for any trap, a default facility is
       available for handling traps to unknown groups or services.  To  enable
       this  facility,  you  must  include  a  "default"  watch  group  with a
       "default" service entry containing  the  specifics  of  alarms.   If  a
       default/default  watch  group  and  service  are  not  configured, then
       unknown traps get logged via syslog, and no alarm is sent.   NOTE:  The
       default/default  facility  is  a single entity as far as accounting and
       alarming go. Alarm programs which are not aware of this fact  may  send
       confusing  information  when  a  failure  trap  comes from one machine,
       followed by a success (ok) trap from a different machine. See the alarm
       environment  variable MON_TRAP_INTENDED above for a possible way around
       this. It is intended that default/default be  used  as  a  facility  to
       catch  unknown  traps, and should not be relied upon to catch all traps
       in a production environment. If you are  lazy  and  only  want  to  use
       default/default  for  catching  all  traps, it would be best to disable
       upalerts, and use the MON_TRAP_INTENDED environment variable  in  alert
       scripts to make the alerts more meaningful to you.

       Here is an example default facility:

              watch default
                   service default
                        description Default trap service
                        period wd {Sun-Sat}
                             alert mail.alert someone@your.org
                             upalert mail.alert -u someone@your.org

EXAMPLES

       The  mon  distribution  comes  with  an  example  configuration  called
       example.cf.  Refer to that file for more information.

HISTORY

       mon was written because I couldn’t find anything  out  there  that  did
       just what I needed, and nothing was worth modifying to add the features
       I wanted. It doesn’t have a cool name, and that bothers  me  because  I
       couldn’t think of one.

BUGS

       Report bugs to the email address below.

AUTHOR

       Jim Trocki <trockij@arctic.org>

NAME

SYNOPSIS

DESCRIPTION

OPTIONS

DEFINITIONS

OPERATION

MONITOR PROGRAMS

ALERT DECISION LOGIC

ALERT PROGRAMS

CONFIGURATION FILE

AUTHENTICATION CONFIGURATION FILE

CLIENT-SERVER INTERFACE

CLIENT INTERFACE COMMANDS

MON TRAPPING

EXAMPLES

SEE ALSO

HISTORY

BUGS

AUTHOR