GPU
The following new features affect GPU support.
Integration with NVIDIA Data Center GPU Manager (DCGM)
The NVIDIA Data Center GPU Manager (DCGM) is a suite of data center management tools that allow you to manage and monitor GPU resources in an accelerated data center. LSF integrates with NVIDIA DCGM to work more effectively with GPUs in the LSF cluster. DCGM provides additional functionality when working with jobs that request GPU resources by:
- providing GPU usage information for EXCLUSIVE_PROCESS mode jobs.
- checking the GPU status before and after the jobs run to identify and filter out unhealthy GPUs.
- synchronizing the GPU auto-boost feature to support jobs that run across multiple GPUs.
Enable the DCGM integration by defining the LSF_DCGM_PORT parameter in the lsf.conf file.