Set host exclusion based on job-based pre-execution scripts

Before you begin

You must know the exit values your pre-execution script exits with that indicate failure.

About this task

Any non-zero exit code in a pre-execution script indicates a failure. For those jobs that are designated as re-runnable on failure, LSF filters on the pre-execution script failure to determine whether the job that failed in the pre-execution script should exclude the host where the pre-execution script failed. That host is no longer a candidate to run the job.

Procedure

  1. Create a pre-execution script that exits with a specific value if it is unsuccessful.
    Example:
    #!/bin/sh
    
    # Usually, when pre_exec failed due to host reason like
    # /tmp is full, we should exit directly to let LSF
    # re-dispatch the job to a different host.
    # For example:
    #     define PREEXEC_RETRY_EXIT_VALUES = 10 in lsb.params
    #     exit 10 when pre_exec  detect that /tmp is full.
    # LSF will re-dispatch this job to a different host under
    # such condition.
    DISC=/tmp
    PARTITION=`df -Ph | grep -w $DISC | awk '{print $6}'`
    FREE=`df -Ph | grep -w $DISC | awk '{print $5}' | awk -F% '{print $1}'`
    
    echo "$FREE"
    if [ "${FREE}" != "" ]
    then
        if [ "${FREE}" -le "2" ] # When there's only 2% available space for
                                 # /tmp on this host, we can let LSF 
                                # re-dispatch the job to a different host 
    
       then
            exit 10
        fi
    fi
    
    # Sometimes, when pre_exec failed due to nfs server being busy,
    # it can succeed if we retry it several times in this script to 
    # affect LSF performance less.
    RETRY=10
    while [ $RETRY -gt 0 ]
    do
        #mount host_name:/export/home/bill /home/bill
        EXIT=`echo $?` 
       if [ $EXIT -eq 0 ]
        then 
         RETRY=0 
     else 
           RETRY=`expr $RETRY - 1`
            if [ $RETRY -eq 0 ]
            then
               exit 99 # We have tried for 9 times.
                       # Something is wrong with nfs server, we need
                       # to fail the job and fix the nfs problem first.
                       # We need to submit the job again after nfs problem
                       # is resolved.
            fi
        fi
    done
  2. In lsb.params, use PREEXEC_EXCLUDE_HOST_EXIT_VALUES to set the exit values that indicate the pre-execution script failed to run.

    Values from 1-255 are allowed, excepting 99 (reserved value). Separate values with a space.

    For the example script above, set PREEXEC_EXCLUDE_HOST_EXIT_VALUES=10.

  3. (Optional) Define MAX_PREEXEC_RETRY to limit the total number of times LSF retries the pre-execution script on hosts.
  4. Run badmin reconfig.

Results

If a pre-execution script exits with value 10 (according to the example above), LSF adds this host to an exclusion list and attempts to reschedule the job on another host.

Hosts remain in a job's exclusion list for a period of time specified in the LSB_EXCLUDE_HOST_PERIOD parameter in lsf.conf, or until mbatchd restarts.

In the multicluster job lease model, LSB_EXCLUDE_HOST_PERIOD does not apply, so jobs remain in a job's exclusion list until mbatchd restarts.

What to do next

To view a list of hosts on a job's host exclusion list, run bjobs -lp.