In my last blog, we ran through an example showing how IBM Spectrum LSF now automatically detects the presence of NVIDIA GPUs on hosts in the cluster and performs the necessary configuration of the scheduler automatically.
In this blog, we take a closer look at the integration between Spectrum LSF and NVIDIA DCGM which provides GPU usage information for jobs submitted to the system. To enable the integration between Spectrum LSF and NVIDIA DCGM, we need to specify the LSF_DCGM_PORT=<port number> parameter in $LSF_ENVDIR/lsf.conf
|root@kilenc:/etc/profile.d# cd $LSF_ENVDIR
root@kilenc:/opt/ibm/lsfsuite/lsf/conf# cat lsf.conf |grep -i DCGM
You can find more details about the LSF_DCGM_PORT and what it enables here.
Before continuing, please ensure that the DCGM daemon is up and running. Below we start DCGM on the default port and run a query command to confirm that it's up and running.
root@kilenc:/opt/ibm/lsfsuite/lsf/conf# dcgmi discovery -l
Next, let's submit a GPU job to IBM Spectrum LSF to demonstrate the collection of GPU accounting. Note that the GPU job must be submitted to Spectrum LSF with the exclusive mode specified in order for the resource usage to be collected. As was the case in my previous blog, we submit the gpu-burn test job (formally known as Multi-GPU CUDA stress test).
test@kilenc:~/gpu-burn$ bsub -gpu "num=1:mode=exclusive_process" ./gpu_burn 120
Job 54086 runs to successful completion and we use the Spectrum LSF bjobs command with the -gpu option to display the GPU usage information, which I've highlighted in bold below.
test@kilenc:~/gpu-burn$ bjobs -l -gpu 54086
Job <54086>, User <test>, Project <default>, Status <DONE>, Queue <normal>, Com
RESOURCE REQUIREMENT DETAILS:
GPU REQUIREMENT DETAILS: