Cleaning up Platform LSF parallel job execution problems

The LSF distributed application integration framework blaunch offers full control of parallel jobs running across multiple execution hosts. This framework is able to perform job cleanup in case of errors such as the following:

Node crash
Node hang
Network connection problems
Unexpected network latency problems
Abnormal task exit

This cleanup work is in addition to the cleanup functionality provided by other parallel execution products such as the IBM Parallel Environment (IBM PE).

Job cleanup refers to the following:

Clean up all left-over processes on all execution nodes
Perform post-job cleanup operations on all execution nodes, such as cleaning up cgroups, cleaning up Kerberos credentials, resetting CPU frequencies, etc.
Clean up the job from LSF and mark job Exit status

The LSF default behavior is designed to handle most common recovery for these scenarios. LSF also offers a set of parameters to allow end users to tune LSF behavior for each scenario, especially how fast LSF can detect each failure and what action LSF should take in response.

There are typically three scenarios requiring job cleanup:

First execution host crashing or hanging
Non-first execution host crashing or hanging
Parallel job tasks exit abnormally

This article describes how to configure LSF to handle these scenarios.

Parallel job first execution host crashing or hanging

REMOVE_HUNG_JOBS_FOR (lsb.params)

When the first execution host crashes or hangs, by default, LSF will mark a running job as UNKNOWN. LSF does not clean up the job until the host comes back and the LSF master host confirms that the job is really gone from the system. However, this default behaviour may not always be desirable, since such hung jobs will hold their resource allocation for some time. Define REMOVE_HUNG_JOBS_FOR in lsb.params to change the default LSF behaviour and remove the hung jobs from the system automatically.

REMOVE_HUNG_JOBS_FOR = all [,param1=val1]| condition1[,param1=val1][:condition2 [,param1=val1]…]

By default, REMOVE_HUNG_JOBS_FOR is not defined. The condition keywords supported in LSF 9.1 are runlimit and host_unavail. The condition names are not case sensitive. The default wait_time is 10 minutes.

The following condition combinations are supported:

all[,wait_time=10]
runlimit[,wait_time=5]:host_unavail[,wait_time=10]
host_unavail[,wait_time=10]:runlimit[,wait_time=5]
runlimit[,wait_time=5]
host_unavail[,wait_time=10]

After setting REMOVE_HUNG_JOBS_FOR, run badmin reconfig as LSF administrator to make the parameter take effect.

For example, you can define

REMOVE_HUNG_JOBS_FOR = runlimit:host_unavail

LSF removes jobs if they run 10 minutes past the job RUN LIMIT or become UNKNOWN for 10 minutes due to the first execution host becoming unavailable.

DJOB_HB_INTERVAL, DJOB_RU_INTERVAL (lsb.applications) and LSF_RES_ALIVE_TIMEOUT (lsf.conf)

When the first execution host crashes or hangs, LSF on non-first execution host can detect the error and trigger job process cleanup on their own hosts. Cleanup is based on a permanent network connection and heartbeat message between the first execution host and non-first execution host. When the network connection is broken, LSF on non-first execution host takes action to clean up job processes. The following parameters control heartbeat and timeout:

LSB_DJOB_HB_INTERVAL
LSB_DJOB_RU_INTERVAL
LSF_RES_ALIVE_TIMEOUT.

LSB_DJOB_HB_INTERVAL and LSB_DJOB_RU_INTERVAL are used for LSF components running on non-first execution hosts to control the frequency of sending heartbeat message and resource usage messages to the LSF component on the first execution host. Once the first execution host receives resource usage and heartbeat messages, it replies with an acknowledgement message to the non-first execution host.

The default value of LSB_DJOB_HB_INTERVAL is 120 seconds per 1000 nodes. Define one of the following to override this value:

DJOB_HB_INTERVAL in an application profile in lsb.applications. Use bsub –app to associate the job with the application profile.
LSB_DJOB_HB_INTERVAL in a job environment variable. The environment variable must be set before job submission and not in an LSF job script. The environment variable overrides the application profile configuration.

The default value of LSB_DJOB_RU_INTERVAL is 300 seconds per 1000 nodes. Define one of the following to override this value:

DJOB_RU_INTERVAL parameter in an application profile in lsb.applications. Use bsub –app to associate the job with the application profile.
LSB_DJOB_RU_INTERVAL in a job environment variable. The environment variable must be set before job submission and not in an LSF job script. The environment variable overrides the application profile configuration.

In case of large, long running parallel jobs, LSB_DJOB_RU_INTERVAL can be set to a long time or even disabled with a 0 value to prevent too frequent resource usage update, which consumes network bandwidth as well as CPU time for LSF to process large volume of resource usage information. LSB_DJOB_HB_INTERVAL cannot be disabled.

After setting DJOB_HB_INTERVAL and DJOB_RU_INTERVAL in an application profile, run badmin reconfig to reconfigure mbatchd to make the change take effect. The changed value will only take effect for new jobs. It does not have any impact on existing running jobs.

LSF_RES_ALIVE_TIMEOUT controls how many seconds an LSF component on a non-first execution host waits for a reply from the first execution host for a heartbeat or rusage message. After the timeout, LSF treats the first execution host as no longer alive, and takes action to clean up.

The default value of LSF_RES_ALIVE_TIMEOUT is 60 seconds. LSF_RES_ALIVE_TIMEOUT set in the job environment overrides the value set in lsf.conf.

NOTE: The environment variables mentioned above must be set BEFORE bsub runs to take effect.

Parallel job non-first execution host crashing or hanging

LSB_FANOUT_TIMEOUT_PER_LAYER (lsf.conf)

Before a parallel job executes, LSF needs to do a few set up work on each job execution host and populate job information to all these hosts. LSF provides a communication fan-out framework to handle this. In the case of execution hosts failure, the framework has timeout value to control how quick LSF treats communication failure and roll back the job dispatching decision. By default, the timeout value is 20 seconds for each communication layer. Define LSB_FANOUT_TIMEOUT_PER_LAYER in lsf.conf to customize the timeout value.

Run badmin hrestart all as LSF administrator to restart all sbatchds on hosts to make the parameter take effect on all nodes.

LSB_FANOUT_TIMEOUT_PER_LAYER can also be defined in environment before job submission to override the value specified in lsf.conf.

You can set a larger value for large size jobs (for example, 60 for jobs across over 1K nodes).

One indicator of the need to tune up this parameter is that bhist -l shows jobs bouncing back and forth between starting and pending due to job timeout errors. Timeout errors are logged in the sbatchd log.

For example,

$ bhist -l 100
Job <100>, User <user1>, Project <default>, Command <mpirun ./cpi>
Mon Oct 21 19:20:43: Submitted from host <hostA>, to Queue <normal>, CW
                     D </home/user1/>, 320 Processors Requested, Reque
                     sted Resources <span[ptile=8]>;
Mon Oct 21 19:20:43: Dispatched to 40 Hosts/Processors <hostA> <hostB> <
……
Mon Oct 21 19:20:43: Starting (Pid 19137);
Mon Oct 21 19:21:06: Pending: Failed to send fan-out information to other SBDs;

LSF_DJOB_TASK_REG_WAIT_TIME (lsf.conf)

When a parallel job is started, an LSF component on the first execution host needs to receive a registration message from other components on non-first execution hosts. By default, LSF waits for 300 seconds for those registration messages. After 300 seconds, LSF starts to clean up the job.

Use LSF_DJOB_TASK_REG_WAIT_TIME customize the time period. The parameter can be defined in lsf.conf or the job environment at job submission. The parameter in lsf.conf applies to all jobs in the cluster, while the job environment variable only controls the behaviour for the particular job. The job environment variable overrides the value in lsf.conf. The unit is seconds. Set a larger value for large jobs ( for example, 600 seconds for jobs across 5000 nodes).

Run lsadmin resrestart all as LSF administrator to restart RES on all nodes to make the change take effect.

You should set this parameter if you see an INFO level message like the following in res.log.<first_execution_host>:

$ grep “waiting for all tasks to register” res.log.hostA

Oct 20 20:20:29 2013 7866 6 9.1.1 doHouseKeeping4ParallelJobs: job 101 timed out <20> waiting for all tasks to register, registered <315> out of <320>

DJOB_COMMFAIL_ACTION (lsb.applications)

After a job is successfully launched and all tasks register themselves, LSF keeps monitoring the connection from the first node to the rest of the execution nodes. If a connection failure is detected, by default, LSF begins to shut down the job. Configure DJOB_COMMFAIL_ACTION in an application profile in lsb.applications to customize the behaviour. The parameter syntax is:

DJOB_COMMFAIL_ACTION="KILL_TASKS|IGNORE_COMMFAIL"

IGNORE_COMMFAIL: LSF allows the job to continue to run. Communication failures between the first node and the rest of the execution nodes are ignored and the job continues.

KILL_TASKS LSF tries to kill all the current tasks of a parallel or distributed job associated with the communication failure.

By default, DJOB_COMMFAIL_ACTION is not defined – LSF terminates all tasks and shuts down the entire job.

You can also use the environment variable LSB_DJOB_COMMFAIL_ACTION before submitting job to override the value set in the application profile.

Parallel job abnormal task exit

RTASK_GONE_ACTION (lsb.applications)

If some tasks exit abnormally during parallel job execution, LSF takes action to terminate and clean up the entire job. This behaviour can be customized with RTASK_GONE_ACTION in an application profile in lsb.applications or with the LSB_DJOB_RTASK_GONE_ACTION environment variable in the job environment.

The LSB_DJOB_RTASK_GONE_ACTION environment variable overrides the setting of RTASK_GONE_ACTION in lsb.applications.

The following values are supported:

[KILLJOB_TASKDONE | KILLJOB_TASKEXIT] [IGNORE_TASKCRASH]

KILLJOB_TASKDONE: LSF terminates all tasks in the job when one remote task exits with a zero value.

KILLJOB_TASKEXIT: LSF terminates all tasks in the job when one remote task exits with non-zero value.

IGNORE_TASKCRASH: LSF does nothing when a remote task crashes. The job continues to run to completion.

By default, RTASK_GONE_ACTION is not defined, so LSF terminates all tasks, and shuts down the entire job when one task crashes.

For example:

Define an application profile in lsb.applications:

Begin Application
NAME         = myApp
DJOB_COMMFAIL_ACTION=IGNORE_COMMFAIL
RTASK_GONE_ACTION="IGNORE_TASKCRASH KILLJOB_TASKEXIT"
DESCRIPTION  = Application profile example
End Application

Run badmin reconfig as LSF administrator to make the configuration take effect.
Submit an MPICH2 job with –app myApp:

$ bsub –app myApp –n4 –R “span[ptile=2]” mpiexec.hydra ./cpi