How LSF works with LSF_MASTER_LIST
The files lsf.shared and lsf.cluster.cluster_name are shared only among LIMs on hosts that are listed as candidates to be elected the management host with the parameter LSF_MASTER_LIST.
The preferred management host is no longer the first host in the cluster list in lsf.cluster.cluster_name, but the first host in the list specified by LSF_MASTER_LIST in lsf.conf.
Whenever you reconfigure, only LIMs on the management host candidates read lsf.shared and lsf.cluster.cluster_name to get updated information. The LIM on the elected management host sends configuration information to child LIMs on the server hosts.
The order in which you specify hosts in LSF_MASTER_LIST is the preferred order for selecting hosts to become the management host.
Non-shared file considerations
Generally, the files lsf.cluster.cluster_name and lsf.shared for hosts that are management candidates should be identical.
When the cluster is started up or reconfigured, LSF rereads configuration files and compares lsf.cluster.cluster_name and lsf.shared for hosts that are management candidates.
In some cases in which identical files are not shared, files may be out of sync. This section describes situations that may arise should lsf.cluster.cluster_name and lsf.shared for hosts that are management candidates not be identical to those of the elected management host.
LSF_MASTER_LIST host eligibility
LSF only rejects candidate management hosts listed in LSF_MASTER_LIST from the cluster if the number of load indices in lsf.cluster.cluster_nameor lsf.shared for management candidates is different from the number of load indices in the lsf.cluster.cluster_name or lsf.shared files of the elected management host.
A warning is logged in the log file lim.log.management_host_name and the cluster continue to run, but without the hosts that were rejected.
If you want the hosts that were rejected to be part of the cluster, ensure the number of load indices in lsf.cluster.cluster_name and lsf.shared are identical for all management candidates and restart LIMs on the management and all management candidates:
bctrld restart lim hostA hostB hostC
Failover with ineligible management host candidates
If the elected management host goes down and if the number of load indices in lsf.cluster.cluster_name or lsf.shared for the new elected management host is different from the number of load indices in the files of the management host that went down, LSF will reject all management candidates that do not have the same number of load indices in their files as the newly elected management host. LSF will also reject all server-only hosts. This could cause a situation in which only the newly elected management host is considered part of the cluster.
A warning is logged in the log file lim.log.new_management_host_name and the cluster continue to run, but without the hosts that were rejected.
To resolve this, from the current management host, restart all LIMs:
lsadmin limrestart all
All server-only hosts will be considered part of the cluster. Candidate management hosts with a different number of load indices in their lsf.cluster.cluster_nameor lsf.shared files will be rejected.
When the management host that was down comes back up, you need to ensure load indices defined in lsf.cluster.cluster_name and lsf.shared for all management candidates are identical and restart LIMs on all management candidates.