Re-sizable job behavior

To optimize resource utilization, LSF allows the job allocation to shrink or grow during the job run time.

Use re-sizable jobs for long-tailed jobs, which are jobs that use many resources for a period, but use fewer resources toward the end of the job. Conversely, use re-sizable jobs for jobs in which tasks are easily parallelizable, where each step or task can be made to run on a separate processor to achieve a faster result. The more resources the job gets, the faster the job can run. Session Scheduler jobs are good candidates.

Without re-sizable jobs, a job’s task allocation is static from the time the job is dispatched until it finishes. For long-tailed jobs, resources are wasted toward the end of the job, even if you use reservation and backfill because estimated run times can be inaccurate. Parallel run slower than they could run if there were more assigned tasks. With re-sizable jobs, LSF can remove tasks from long-tailed jobs when the tasks are no longer needed, or add extra tasks to parallel jobs when needed during the job’s run time.

Automatic or manual resizing

An automatically re-sizable job is a re-sizable job with a minimum and maximum task request, where LSF automatically schedules and allocates more resources to satisfy the job maximum request as the job runs. Specify an automatically re-sizable job at job submission time by using the bsub -ar option.

For automatically re-sizable jobs, LSF automatically recalculates the pending allocation requests. LSF is able to allocate more tasks to the running job. For instance, if a job requests a minimum of 4 and a maximum of 32, and LSF initially allocates 20 tasks to the job initially, its active pending allocation request is for another 12 tasks. After LSF assigns another four tasks, the pending allocation request is now eight tasks.

You can also manually shrink or grow a running job by using the bresize command. Shrink a job by releasing tasks from the specified hosts with the bresize release subcommand. Grow a job by requesting more tasks with the bresize request subcommand.

Pending allocation request

A pending allocation request is an extra resource request that is attached to a re-sizable job. Running jobs are the only jobs that can have pending allocation requests. At any time, a job has only one allocation request.

LSF creates a new pending allocation request and schedules it after a job physically starts on the remote host (after LSF receives the JOB_EXECUTE event from the sbatchd daemon) or resize notification command successfully completes.

Resize notification command

A resize notification command is an executable that is invoked on the first execution host of a job in response to an allocation (grow or shrink) event. It can be used to inform the running application for allocation change. Due to the variety of implementations of applications, each re-sizable application might have its own notification command that is provided by the application developer.

The notification command runs under the same user ID environment, home, and working directory as the actual job. The standard input, output, and error of the program are redirected to the NULL device. If the notification command is not in the user's normal execution path (the $PATH variable), the full path name of the command must be specified.

A notification command exits with one of the following values:

LSB_RESIZE_NOTIFY_OK

LSB_RESIZE_NOTIFY_FAIL

LSF sets these environment variables in the notification command environment. The LSB_RESIZE_NOTIFY_OK value indicates that the notification succeeds. For allocation grow and shrink events, LSF updates the job allocation to reflect the new allocation.

The LSB_RESIZE_NOTIFY_FAIL value indicates notification failure. For allocation "grow" event, LSF reschedules the pending allocation request. For allocation "shrink" event, LSF fails the allocation release request.

For a list of other environment variables that apply to the resize notification command, see the environment variables reference.