Resource connector enhancements

The following enhancements affect LSF resource connector.

View resource connector information with the badmin command

You can now view LSF resource connector information with the badmin command. LSF now has the badmin rc view subcommand, which allows you to view LSF resource connector information, and the badmin rc error command, which allows you to view error messages from the host providers. To get the error messages, the third-party mosquitto message queue application must be running on the host.

Specify the maximum number of error messages that badmin rc error displays for each host provider by defining the LSB_RC_MQTT_ERROR_LIMIT parameter in the lsf.conf file.

Dynamic resource snapshot

LSF resource connector can now query any host provider API for dynamic resources at each snapshot interval. This allows LSF to natively handle problems with resource availability on resource providers.

In addition, if the ebrokerd daemon encounters certain provider errors while querying the host provider API for dynamic resources, LSF introduces a delay before ebrokerd repeats the request. Specify the new LSB_RC_TEMPLATE_REQUEST_DELAY parameter in the lsf.conf file to define this delay, in minutes.

Host management with instance IDs

LSF resource connector can now pass the instance ID of provisioned hosts to LSF when the hosts join the cluster. This provides an additional way for the resource connector hosts to identify themselves to the LSF cluster, which improves the redundancy and fault tolerance for the resource connector hosts.

In addition, the bhosts -rc and bhosts -rconly command options now display the instance IDs of the provisioned hosts.

Improved fault tolerance

The LSF resource connector now has improved fault tolerance by allowing CLOSED_RC hosts to be switched to ok more quickly, resulting in less wasted resources. Instance ID, cluster name, and template name are added to host local resources for more accurate information and better tracking. Instance ID is added to the JOB_FINISH event in the lsb.acct file.

Google provider synchronization

The LSF resource connector Google provider plugin can now synchronize hosts between LSF and the cloud.

Configure timeout values for each host provider

The LSF resource connector now allows you to configure timeout values for each host provider. Specify the provHostTimeOut parameter for each provider in the hostProviders.json file. The default value is 10 minutes. If a resource connector host does not join the LSF cluster within this timeout value, the host is relinquished.

Detect an NVIDIA sibling GPU under a PCI

LSF now enables lim and elim.gpu.topology to detect GPUs properly on hosts that have two sibling GPUs under one PCI if the first GPU is not an NVIDIA GPU but the second GPU is an NVIDIA GPU.

Note: For LSF resource connector on AWS, you must create an updated image, then use the new AMI ID to borrow the GPU instance from AWS.

Improved Azure support

The LSF resource connector now follows the official Azure documentation for HTTP/HTTPS proxies by calling the API from the Azure native library instead of the Java system library.

Candidate host group sorting

LSF now changes how the scheduler sorts candidate host groups when multiple templates are defined in LSF resource connector. The candidate host groups are now sorted based on template priority (previously, the order of these groups was undefined). LSF determines the template priority from the first host within the candidate host group.