Known issues

LSF 10.1 has the following known issues.

  • The Nvidia Data Center GPU Manager (DCGM) integration is enabled by defining the LSF_DCGM_PORT parameter in the lsf.conf file. Note, however, that Nvidia Multi-Instance GPU (MIG) integration with DCGM does not work with LSF and will be addressed in a future LSF fix.
  • When you set global limits for job resource allocation, the global limit is only set for the resource on the local cluster; it does not does not take effect on the resource globally. For instance, if you check global resource usage, by running the bgpinfo resource command, it does not show that the resource has been released after running the job (that is, the reserved resources number does not reduce). However, if you check the local resource usage, by running the bhosts -s command on the local cluster, this shows released resources for that cluster. The local cluster, is however, unaware of any global limits set on the global resources.
  • If the LSF_STRICT_CHECKING parameter is not defined in the $LSF_ENVDIR/lsf.conf file, there are known issues for the existing running or suspended blaunch jobs after the cluster is updated to LSF 10.1 Fix Pack 12:
    • There are XDR error messages in the sbatchd and RES log files.
    • LSF cannot update the runtime resource usage information of existing running blaunch jobs.
    • LSF cannot stop existing running blaunch jobs.
    • LSF cannot resume existing suspended blaunch jobs.

    To avoid these problems, ensure that there are no running or suspended blaunch jobs before you update your cluster to LSF 10.1 Fix Pack 12.

  • External authentication (eauth) fails when it cannot resolve the true operating system uid or gid of the calling process. These situations might occur within nested namespaces.
  • When specifying time zones in automatic time-based configurations, if you specify the same abbreviation for multiple time zones, LSF might not select the correct time zone. A patch will be made available shortly after the release of Fix Pack 9 to address this issue.
  • The DCGM (NVIDIA Data Center GPU Manager) integration does not work as expected due to a missing libdcgm.so file. To resolve this issue, create a softlink to ensure that the libdcgm.so file exists and is accessible:
    sudo ln -s /usr/lib64/libdcgm.so.1 /usr/lib64/libdcgm.so
  • On RHEL 8, the LSF cluster cannot start up due to a missing libnsl.so.1 file. To resolve this issue, install the libnsl package to ensure that the libnsl.so.1 exists.
    yum install libnsl
  • On AIX, a TCL parser issue causes jobs to pend when the LSF_STRICT_RESREQ=N parameter is set in the lsf.conf file, even though AIX hosts are available. To avoid the problem, make sure that LSF_STRICT_RESREQ=Y.
  • While running a job, a RHEL 7.2 server host may fail with the following error messages in the system log file or the system console:
    INFO: rcu_sched self-detected stall on CPU { number}
    INFO: rcu_sched detected stalls on CPUs/tasks:
    BUG: soft lockup - CPU#number stuck for time! [res:16462]

    This is an issue with RHEL 7.2 kernel-3.10.0-327.el7. To resolve this issue, download and apply a RHEL kernel security update. For more details, refer to https://rhn.redhat.com/errata/RHSA-2016-2098.html.