How re-sizable jobs work with other LSF features
Re-sizable jobs behave differently when used together with other LSF features.
- Resource usage
- When a job grows or shrinks, its resource reservation (for example memory or shared resources)
changes proportionately.
- Job-based resource usage does not change in grow or shrink operations.
- Host-based resource usage changes only when the job gains tasks on a new host or releases all tasks on a host.
- Task-based resource usage changes whenever the job grows or shrinks.
- Limits
- Tasks are only added to a job's allocation when resize occurs if the job does not violate any resource limits placed on it.
- Job scheduling and dispatch
- The JOB_ACCEPT_INTERVAL parameter in lsb.params or lsb.queues controls the number of seconds to wait after dispatching a job to a host before dispatching a second job to the same host. The parameter applies to all allocated hosts of a parallel job. For re-sizable job allocation requests, JOB_ACCEPT_INTERVAL applies to newly allocated hosts.
- Chunk jobs
- Because candidate jobs for the chunk job feature are short-running sequential jobs, the
re-sizable job feature does not support job chunking:
- Automatically resizable jobs in a chunk queue or application profile cannot be chunked together.
- bresize commands to resize job allocations do not apply to running chunk job members.
- Energy aware scheduling
- In the case that a job is resizable, bjobs can only get the energy cost of the latest re-sizable job's executive hosts.
- Requeued jobs
- Jobs requeued with brequeue start from the beginning. After re-queuing, LSF restores the original allocation request for the job.
- Launched jobs
- Parallel tasks running through blaunch can be re-sizable. Automatic job resizing is a signaling mechanism only. It does not expand the extent of the original job launched with blaunch. The resize notification script is required along with a signal listening script. The signal listening script runs additional blaunch commands on notification to allocate the re-sized resources to make them available to the job tasks. For help creating signal listening and notification scripts, contact IBM Support.
- Switched jobs
- bswitch can switch re-sizable jobs between queues regardless of job state (including job’s resizing state). Once the job is switched, the parameters in new queue apply, including threshold configuration, run limit, CPU limit, queue-level resource requirements, etc.
- User group administrators
- User group administrators are allowed to issue bresize commands to release a part of resources from job allocation (bresize release), request additional tasks to allocate to a job (bresize request), or cancel active pending resize request (bresize cancel).
- Re-queue exit values
- If job-level, application-level or queue-level REQUEUE_EXIT_VALUES are defined, and as long as job exits with a defined exit code, LSF puts the re-queued job back to PEND status. For re-sizable jobs, LSF schedules the job according to the initial allocation request regardless of any job allocation size change.
- Automatic job rerun
- A re-runnable job is rescheduled after the first running host becomes unreachable. Once job is rerun, LSF schedules re-sizable jobs that are based on their initial allocation request.
- Compute units
- Automatically re-sizable jobs can have compute unit requirements.
- Alternative resource requirements
- Re-sizable jobs can have alternative resource requirements. When using bresize request to request additional tasks, the task increase is based on the term used for the initial task allocation.
- Compound resource requirements
- Re-sizable jobs can have compound resource requirements. Only the portion of the job represented by the last term of the compound resource requirement is eligible for automatic resizing. When using bresize release to release tasks or bresize request to request additional tasks, you can only release tasks represented by the last term of the compound resource requirement. To release or request tasks in earlier terms, run bresize release or bresize request repeatedly to release or request tasks in subsequent last terms.
- GPU resource requirements
- If the value of the LSB_GPU_NEW_SYNTAX setting in the lsf.conf
configuration file is configured with a value of extend, when jobs with GPU
resource requirement grow or shrink tasks, GPU allocations grow or shrink accordingly based on those
resource requirements. Note: When job slots and GPUs shrink, the whole host shrinks. When releasing all hosts, all execution hosts except the first execution host will be released. Releasing partial slots for the execution host is not supported.This table outlines how GPU resource requirements impact re-sizeable jobs:
GPU resource requirement Support Behavior for resize requirement num=num_gpus[/task | host]
Yes When a job grows or shrinks slots, its GPU usage changes proportionately. mode=shared | exclusive_process
Yes When a job grows or shrinks slots, its GPU usage changes proportionately. mps
No Automatically resizing jobs will have no pending requests, and manually running bresize will be rejected. j_exclusive=yes | no
Yes When a job grows or shrinks slots, its GPU usage changes proportionately. aff=yes | no
No Automatically resizing jobs will have no pending requests, and manually running bresize will be rejected. block=yes | no
Yes Block distribution will be applied to newly allocated slots. gpack=yes | no
Yes If a new host will be allocated to a re-sizable job, when selecting GPU and hosts, it will consider gpack
policy for pack scheduling.gvendor=amd | nvidia
Yes LSF allocates GPUs with the specified vendor type. gmodel=model_name[-mem_size]
Yes LSF allocates the GPUs with the same model, if available. gmem=mem_value
Yes LSF allocates GPU memory on each newly allocated GPU required by the new tasks. gtile=! | tile_num
Yes gtile
requirements only apply to newly allocated GPUs.glink=yes
Yes Enables job enforcement for special connections among new allocated GPUs for new tasks. mig
No Automatically resizing jobs will have no pending requests, and manually running bresize will be rejected. While LSF re-sizes jobs, hosts with GPU devices that enable mig
will not be considered.