Nvidia Data Center GPU Manager (DCGM) features
The Nvidia Data Center GPU Manager (DCGM) is a suite of data center management tools that allow you to manage and monitor GPU resources in an accelerated data center.
LSF integrates with Nvidia DCGM to work more effectively with GPUs in the LSF cluster. DCGM provides additional functionality when working with jobs that request GPU resources by:
- providing GPU usage information for EXCLUSIVE_PROCESS mode jobs.
- checking the status of GPUs to automatically filter out unhealthy GPUs when the job allocates GPU resources. This ensures that jobs are running on healthy GPUs. DCGM provides mechanisms to check the GPU health and LSF integrates these mechanisms to check the GPU status before, during, and after the job is running to meet the GPU requirements. If the execution host's DCGM status is not valid, the bjobs -l command shows an error message. The job still runs, but GPU resource usage reports are not available from that host.
- automatically adding back any previously-unhealthy GPUs that are healthy again so that these GPUs are available for job allocation.
- synchronizing the GPU auto-boost feature to support jobs that run across multiple GPUs, including jobs that run across multiple GPUs on a single host.
Enable the DCGM integration by defining the LSF_DCGM_PORT parameter in the lsf.conf file. After enabling the parameter, you must start up DCGM to use the features.
sudo ln -s /usr/lib64/libdcgm.so.1 /usr/lib64/libdcgm.so
Run the -gpu option with the bjobs, bhist, and bacct commands to display GPU usage information from DCGM after the job finishes. The -gpu option must be used with the following command options:
- For the bjobs command, you must run the -gpu option with the -l or -UF options.
- For the bhist command, you must run the -gpu option with the -l option.
- For the bacct command, you must run the -gpu option with the -l option.