Configuring GPU run time

Procedure

  1. Set a value for the GPU_RUN_TIME_FACTOR parameter for the queue in lsb.queues or for the cluster in lsb.params.
  2. To enable historical GPU run time of finished jobs, specify ENABLE_GPU_HIST_RUN_TIME=Y for the queue in lsb.queues or for the cluster in lsb.params.

    Enabling historical GPU time ensures that the user's priority does not increase significantly after a GPU job finishes.

Results

If you set the GPU run time factor and enabled the use of GPU historical run time, the dynamic priority is calculated according to the following formula:

dynamic priority = number_shares / (cpu_time * CPU_TIME_FACTOR + (historical_run_time + run_time) * RUN_TIME_FACTOR + (committed_run_time - run_time) * COMMITTED_RUN_TIME_FACTOR + (1 + job_slots) * RUN_JOB_FACTOR + fairshare_adjustment(struct* shareAdjustPair)*FAIRSHARE_ADJUSTMENT_FACTOR) + ((historical_gpu_run_time + gpu_run_time) * ngpus_physical) * GPU_RUN_TIME_FACTOR

For historical_run_time, if ENABLE_GPU_HIST_RUN_TIME is defined in the lsb.params file, the historical_run_time is the same as the job's run time (measured in hours) of finished GPU jobs, and a decay factor from time to time based on HIST_HOURS in the lsb.params file (5 hours by default).

Note that:
  • For jobs that ask for exclusive use of a GPU, gpu_run_time is the same as the job's run time and ngpus_physical is the value of the requested ngpus_physical in the job's effective RES_REQ string.
  • For jobs that ask for an exclusive host (with the bsub -x option), the gpu_run_time is the same as the job's run time and ngpus_physical is the number of GPUs on the execution host.
  • For jobs that ask for an exclusive compute unit (bsub -R "cu[excl]" option), the gpu_run_time is the same as the job's run time and ngpus_physical is the number of GPUs or all the execution hosts in the compute unit.
  • For jobs that ask for shared mode GPUs, these jobs do not contribute to dynamic user priority calculations. They do not get charged for fair sharing.
The gpu_run_timevalue is the run time requested at GPU job submission with the -gpu option of bsub, the queue or application profile configuration with the GPU_REQ parameter, or the cluster configuration with the LSB_GPU_REQ parameter.