Before you begin
You must know the exit values your pre-execution script exits with that indicate
failure.
About this task
Any non-zero exit code in a pre-execution script indicates a failure. For those jobs that are
designated as re-runnable on failure, LSF filters on the pre-execution script failure to determine
whether the job that failed in the pre-execution script should exclude the host where the
pre-execution script failed. That host is no longer a candidate to run the job.
Procedure
- Create a pre-execution script that exits with a specific value if it is
unsuccessful.
Example:
#!/bin/sh
# Usually, when pre_exec failed due to host reason like
# /tmp is full, we should exit directly to let LSF
# re-dispatch the job to a different host.
# For example:
# define PREEXEC_RETRY_EXIT_VALUES = 10 in lsb.params
# exit 10 when pre_exec detect that /tmp is full.
# LSF will re-dispatch this job to a different host under
# such condition.
DISC=/tmp
PARTITION=`df -Ph | grep -w $DISC | awk '{print $6}'`
FREE=`df -Ph | grep -w $DISC | awk '{print $5}' | awk -F% '{print $1}'`
echo "$FREE"
if [ "${FREE}" != "" ]
then
if [ "${FREE}" -le "2" ] # When there's only 2% available space for
# /tmp on this host, we can let LSF
# re-dispatch the job to a different host
then
exit 10
fi
fi
# Sometimes, when pre_exec failed due to nfs server being busy,
# it can succeed if we retry it several times in this script to
# affect LSF performance less.
RETRY=10
while [ $RETRY -gt 0 ]
do
#mount host_name:/export/home/bill /home/bill
EXIT=`echo $?`
if [ $EXIT -eq 0 ]
then
RETRY=0
else
RETRY=`expr $RETRY - 1`
if [ $RETRY -eq 0 ]
then
exit 99 # We have tried for 9 times.
# Something is wrong with nfs server, we need
# to fail the job and fix the nfs problem first.
# We need to submit the job again after nfs problem
# is resolved.
fi
fi
done
- In lsb.params, use PREEXEC_EXCLUDE_HOST_EXIT_VALUES to set the exit values that indicate the pre-execution
script failed to run.
Values from 1-255 are allowed, excepting 99 (reserved value). Separate values with a space.
For the example script above, set PREEXEC_EXCLUDE_HOST_EXIT_VALUES=10.
- (Optional) Define MAX_PREEXEC_RETRY to limit the total
number of times LSF retries the pre-execution script on hosts.
- Run badmin reconfig.
Results
If a pre-execution script exits with value 10 (according to the example above), LSF adds this
host to an exclusion list and attempts to reschedule the job on another host.
Hosts remain in a job's exclusion list for a period of time specified in the
LSB_EXCLUDE_HOST_PERIOD parameter in lsf.conf, or until
mbatchd restarts.
In the multicluster job lease model, LSB_EXCLUDE_HOST_PERIOD does
not apply, so jobs remain in a job's exclusion list until mbatchd restarts.
What to do next
To view a list of hosts on a job's host exclusion list, run bjobs -lp.