LSF_DCGM_PORT

Enables the NVIDIA Data Center GPU Manager (DCGM) features and specifies the port number that LSF uses to communicate with the DCGM daemon.

Syntax

LSF_DCGM_PORT=port_number

Description

Define this parameter to enable DCGM features with LSF. LSF uses this port to communicate with the DCGM daemon. The port number should be set to the port reported by /usr/bin/nv-hostengine. For example, if the DCGM daemon indicates that the port is 5555, for instance:
$ /usr/bin/nv-hostengine
Started host engine version 2.3.5 using port number: 5555
then set LSF_DCGM_PORT=5555.

DCGM provides additional functionality when working with jobs that request GPU resources by:

  • providing GPU usage information for EXCLUSIVE_PROCESS mode jobs.
  • checking the status of GPUs to automatically filter out unhealthy GPUs when the job allocates GPU resources, and to automatically add back the GPU if it becomes healthy again. DCGM provides mechanisms to check the GPU health and LSF integrates these mechanisms to check the GPU status before, during, and after the job is running to meet the GPU requirements. If LSF detects that a GPU is not healthy before the job is complete, LSF requeues the job. This ensures that the job runs on healthy GPUs. If the execution host's DCGM status is not valid, the bjobs -l command shows an error message. The job still runs, but GPU resource usage reports are not available from that host.
  • synchronizing the GPU auto-boost feature to support jobs that run across multiple GPUs, including jobs that run across multiple GPUs on a single host.

Run the -gpu option with the bjobs, bhist, and bacct commands to display GPU usage information from DCGM after the job finishes. The -gpu option must be used with the following command options:

  • For the bjobs command, you must run the -gpu option with the -l or -UF options.
  • For the bhist command, you must run the -gpu option with the -l option.
  • For the bacct command, you must run the -gpu option with the -l option.

After changing this parameter, restart the sbatchd and RES daemons to apply the change.

After enabling the parameter, you must start up DCGM to use the features.

Note: If the DCGM integration does not work as expected due to a missing libdcgm.so file, create a softlink to ensure that the libdcgm.so file exists and is accessible:
sudo ln -s /usr/lib64/libdcgm.so.1 /usr/lib64/libdcgm.so

Default

Not defined. DCGM features are disabled.

See also

MBD_REFRESH_TIME and NEWJOB_REFRESH in lsb.params