Configure the application profile to enable the watchdog feature and specify the LSF Application Center
Notifications server to receive notifications.
Before you begin
To ensure that the watchdog scripts can send notifications to the LSF Application Center
Notifications server, define the LSF_AC_PNC_URL parameter in the
lsf.conf file.
Procedure
-
Create a watchdog script to monitor the application (by checking application data, logs, and
other information) and send notification messages.
In the script, use the bpost -N command option to send a notification (with
the message in the -d option and the specified error level) to the LSF Application Center
Notifications server:
bpost -d "message" -N WARNING | ERROR |
CRITICAL | INFO
All job environment variables are available to the watchdog scripts. In addition, the following
LSF job-level resource consumption environment variables are available to the watchdog
scripts:
- LSB_GPU_ALLOC_INFO
- LSB_JOB_AVG_MEM
- LSB_JOB_CPU_TIME
- LSB_JOB_MAX_MEM
- LSB_JOB_MEM
- LSB_JOB_NTHREAD
- LSB_JOB_PGIDS
- LSB_JOB_PIDS
- LSB_JOB_RUN_TIME
- LSB_JOB_SWAP
The watchdog script might have the following format:
#!/bin/sh
source <lsf_conf_dir>/profile.lsf
<application_checking_commands>
if <okay> then
exit 0
else
if <warning_level> then
bpost -N WARNING -d "WARNING: <warning_message>"
exit 0
else
bpost -N CRITICAL -d "FATAL: <critical_message>"
exit 1
end if
end if
Note: You must add a command to source the LSF environment at the beginning of the watchdog script.
-
Set the proper permissions for the script to ensure that the job submission user is able to
execute the script.