IBM Spectrum LSF 10.1 Fix Pack 11 GPU enhancements

Improved performance for GPU resource collection

LSF has performance improvements for GPU metric collection by changing how the management host LIM and server host LIMs collect GPU resource information. These improvements include enabling the mbatchd daemon to no longer get GPU resource information from the management host LIM, and by completely removing the built-in GPU resources (gpu_<num>n) from the management host LIM.

To enable mbatchd to no longer get GPU resources from the management host LIM, set the LSF_GPU_RESOURCE_IGNORE parameter to Y in the lsf.conf file. This improves LSF response time because there are fewer LSF resources to manage and display.

In addition, if LSF_GPU_AUTOCONFIG is set to Y and LSB_GPU_NEW_SYNTAX is set to Y or extend, all built-in GPU resources (gpu_<num>n) are completely removed from the management host LIM. LSF uses a different method for the management host LIM and server host LIMs to collect GPU information. This further improves performance by having fewer built-in LSF resources.

The LSF_GPU_RESOURCE_IGNORE, LSF_GPU_AUTOCONFIG, and LSB_GPU_NEW_SYNTAX are all preexisting parameters in the lsf.conf file. To fully improve the performance of GPU resource collection, enable these parameters to completely remove the built-in GPU resources from the management host LIM:

LSF_GPU_AUTOCONFIG=Y
LSB_GPU_NEW_SYNTAX=extend
LSF_GPU_RESOURCE_IGNORE=Y

Note: If you are using LSF RTM, make sure that you are running LSF RTM, Version 10.2 Fix Pack 11, or later. If you cannot update LSF RTM to at least this version, do not set LSF_GPU_RESOURCE_IGNORE to Y, otherwise LSF RTM will not show host GPU information. This is because LSF RTM uses LSF resources to get host GPU information.

Displaying power utilization for each GPU

You can now enable the lsload -gpuload command to display the power utilization per GPU on the host.

The lsload -gpuload command displays the power utilization for each GPU in the new gpu_power column.

Changes to Nvidia GPU resource requirements

LSF now has changes to the selection of Nvidia GPU resources in the GPU resource requirement string (bsub -gpu option or the GPU_REQ parameter in the lsb.queues and lsb.applications files).

The new gvendor keyword in the GPU requirements strings enables LSF to allocate GPUs with the specified vendor type. Specify gvendor=nvidia to request Nvidia GPUs, which is the default value.

The nvlink=yes keyword in the GPU requirements string is deprecated. Replace nvlink=yes in the GPU requirements string with glink=yes instead.

The new glink keyword in the GPU requirements strings specifies enables job enforcement for special connections among GPUs. If you specify glink=yes when using Nvidia GPUs, LSF must allocate GPUs that have the NVLink connection. By default, LSF can allocate GPUs that do not have special connections if there are an insufficient number of GPUs with these connections. Do not use glink with the nvlink keyword, which is now deprecated.

In addition, the lshosts -gpu and bhosts -gpu command options now show the GPU vendor type (AMD or Nvidia).

Support for AMD GPUs

LSF now supports the use of AMD GPUs.

You can now specify AMD GPU models in the GPU requirements strings (bsub -gpu option or the GPU_REQ parameter in the lsb.queues and lsb.applications files).

The new gvendor keyword in the GPU requirements strings enables LSF to allocate GPUs with the specified vendor type. Specify gvendor=amd to request AMD GPUs. If this is not specified, the default is to request Nvidia GPUs.

The new glink keyword in the GPU requirements strings specifies enables job enforcement for special connections among GPUs. If you specify glink=yes when using AMD GPUs, LSF must allocate GPUs that have the xGMI connection. By default, LSF can allocate GPUs that do not have special connections if there are an insufficient number of GPUs with these connections. Do not use glink with the nvlink keyword, which is now deprecated.

In addition, the lshosts -gpu and bhosts -gpu command options now show the GPU vendor type (AMD or Nvidia).

To enable GPU requirements strings and use AMD GPUs in these requirements strings, LSF_GPU_AUTOCONFIG=Y, LSB_GPU_NEW_SYNTAX=extend, and LSF_GPU_RESOURCE_IGNORE=Y must be defined in the lsf.conf file.

Improved GPU preemption

LSF Version 10.1 Fix Pack 7 introduced preemptive scheduling for GPU jobs so that a lower priority GPU job can release GPU resources for higher priority GPU jobs, with certain restrictions on the types of jobs that are involved in preemptive scheduling. LSF Version 10.1 Fix Pack 11 now introduces improvements to GPU preemption that removes or relaxes several of these previous restrictions on GPU jobs, including the following:

Non-GPU jobs can now preempt lower priority GPU jobs.
GPU jobs no longer have to be configured for automatic job migration and rerun to be preemptable by higher priority jobs. That is, the MIG parameter no longer has to be defined and the RERUNNABLE parameter no longer has to be set to yes in the lsb.queues or lsb.applications file. Ensure that you properly configure the MIG, RERUNNABLE, or REQUEUE parameters to ensure that GPU resources are properly released after the job is preempted.
GPU jobs no longer have to have either mode=exclusive_process or j_exclusive=yes set to be preempted by other GPU jobs. GPU jobs can also use mode=shared if the GPU is used by only one shared-mode job.
Higher priority GPU jobs cannot preempt shared-mode GPU jobs if there are multiple jobs running on the GPU.

Previously, to enable GPU preemption, you define the LSB_GPU_NEW_SYNTAX parameter in the lsf.conf file as either Y or extend, then configure the PREEMPTABLE_RESOURCES parameter in the lsb.params file to include the ngpus_physical resource. LSF then treats the GPU resources the same as other preemptable resources.

To enable the improved GPU preemption features introduced in LSF Version 10.1 Fix Pack 11, you must define the LSB_GPU_NEW_SYNTAX parameter in the lsf.conf file as extend (not as Y), then configure the PREEMPTABLE_RESOURCES parameter in the lsb.params file to include the ngpus_physical resource. If you define the LSB_GPU_NEW_SYNTAX parameter in the lsf.conf file as Y instead of extend, GPU job preemption is enabled without these improvements and still has the restrictions from LSF Version 10.1 Fix Pack 7.