Update IBM Spectrum LSF Suite for HPC that is locally installed on a single host.
Before you begin
IBM Spectrum LSF Suite for HPC must be installed locally with one management host. For more
details, refer to Determining the cluster configuration.
Choose a time in which your cluster is not running any jobs, such as during a scheduled
maintenance window, or quiesce the cluster to minimize the number of running jobs while you are
updating the cluster.
About this task
Because there is only one management host, there are no
separate hosts on which you can test the Fix Pack. Therefore, the Fix Pack applies to the entire
live cluster.
Procedure
-
Download the Fix Pack from IBM Fix Central.
-
Log in to the LSF
management host.
-
Back up the deployer files.
For local installations, the existing IBM Spectrum LSF Suite for HPC installation RPM files for the deployer are necessary in case there are problems
with the Fix Pack and you need to roll back to the previous version.
For example, back up the contents of the /opt/ibm/lsf_installer/playbook
directory:
cd /opt/ibm/lsf_installer
tar zcvf lsf-rpm-backup.tgz playbook/
-
Back up the contents of the LSF
work and conf directories.
cd /opt/ibm/lsfsuite/lsf/
tar zcvf lsf-conf-backup.tgz conf/
tar zcvf lsf-work-backup.tgz work/
Note: Any new job submissions cause changes to the work directory. Therefore,
records of these new jobs will not exist in the backups if they are submitted after you backed up
the contents of the work directory.
- Back up Elasticsearch.
As of version 10.2 Fix Pack 10, Elasticsearch,
Logstash, and Kibana (ELK) are no longer bundled with the installation package. Customers who wish
to use a newer version or want to use specific Elasticsearch/Kibana/Logstash features must download
and install them separately. Otherwise, customers can still use their existing 10.2 FP9 installed
ELK package.
Note: The supported ELK version for version 10.2 Fix Pack 10 is 7.2.x or higher (but less than
version 8). IBM Spectrum LSF Suite for HPC 10.2 Fix Pack 10 was fully tested on ELK 7.2.1.
See Installing Elasticsearch, Kibana, and Logstash for instructions on installing an external version
of Elasticsearch and configuration requirements for upgrading from a previous version of IBM Spectrum LSF Suite for HPC using a bundled version of Elasticsearch.
Updating Elasticsearch will perform a re-indexing of the current indices. It is strongly
recommended to perform a data backup before proceeding. On configurations with multiple nodes
Elasticsearch, the backup directory must be mounted on each node using NFS. Creating the snapshot
will create the backup in each node, onto the NFS directory. Refer to
https://www.elastic.co/guide/en/elasticsearch/reference/6.6/modules-snapshots.html for more details.
Note: The default ES_PORT is 9200.
- Log in to every GUI_Role machine as root.
- Configure the Elasticsearch snapshot repository.
- If there is only one GUI_Role machine, put the snapshot repository on a local disk.
- Create the directory /opt/ibm/elastic/elasticsearch_repo with write and
execute permission for lsfadmin.
- In /opt/ibm/elastic/elasticsearch/config/elasticsearch.yml change the path
from /opt/ibm/elastic/elasticsearch_repo to
/opt/ibm/elastic/elasticsearch/config/elasticsearch.yml
- If there are multiple GUI_Role machines, the snapshot repository MUST be on a shared file system
(NFS) that all GUI_Role machines can access.
- On each GUI_Role machine, define the same shared location.
Create a directory
[share_dir]/elasticsearch_repo with write and execute permission for
lsfadmin. For example: /mnt/elasticsearch_repo
- In /opt/ibm/elastic/elasticsearch/config/elasticsearch.yml, change the
path from /mnt/elastic/elasticsearch_repo to
/opt/ibm/elastic/elasticsearch/config/elasticsearch.yml.
- Restart Elasticsearch to make the above changes take effect on each
GUI_ROLE
machine:systemctl restart elasticsearch-for-lsf.service
- Stop the following services on each GUI_Host machine:
perfadmin stop all
pmcadmin stop
systemctl stop logstash-for-lsf.service
systemctl stop metricbeat-for-lsf.service
systemctl stop filebeat-for-lsf.service
- Log in to a
GUI_Role
machine.
- Create the repository location: es_backup in Elasticsearch. At a command
prompt, enter the
command:
curl -XPUT "[GUI_ROLE machine IP]:ES_PORT/_snapshot/es_backup" -H 'Content-Type: application/json' -d '{"type": "fs","settings": {"location": "es_backup_location","include_global_state": true,"compress": true}}'
- Create a snapshot:
es_snapshot:
curl -XPOST [GUI_ROLE machine IP]:ES_PORT/_snapshot/es_backup/data_backup?wait_for_completion=true -H 'Content-Type: application/json' -d '{ "indices": "lsf*,mo*,ibm*", "ignore_unavailable": true, "include_global_state": false }'
- Check the status of the
snapshot:
curl -XGET [GUI_ROLE machine IP]:ES_PORT/_snapshot/es_backup/data_backup?pretty
- Restart the services on each GUI_Host machine:
perfadmin start all
pmcadmin start
systemctl start logstash-for-lsf.service
systemctl start metricbeat-for-lsf.service
systemctl start filebeat-for-lsf.service
-
From the Fix Pack downloaded in step 1, run the suite_fix.bin or
suite_fixpack.bin file on the deployment host.
-
From the /opt/ibm/lsf_installer/playbook directory, run
the installation with the lsf-upgrade.yml playbook to update your cluster with
the Fix Pack.
ansible-playbook -i lsf-inventory lsf-upgrade.yml
This playbook shuts down the LSF
daemons, updates and rebuilds the contents of the shared directory, then restarts the LSF
daemons.
Important:
By default, any parameter changes in lsf-config.yml are not reflected by
running lsf-upgrade.yml.
Therefore, if you have made any parameter changes in
lsf-config.yml, run
lsf-upgrade.yml in the command line with an external variable
force_run_deploy=Y. This will run
lsf-upgrade.yml and
lsf-deploy.yml
sequentially.
ansible-playbook -i lsf-inventory lsf-upgrade.yml -e force_run_deploy=Y
Note as well that force_run_deploy=Y requires more time than the default
command (that is, without the external variable set) when running
lsf-upgrade.yml.
-
Run some commands to verify the update.
-
Run the lsid to see your cluster name and management host name.
-
Run the lshosts command to see the LSF
management host. The
LSF server hosts and client hosts are also listed.
-
Run the bhosts command to check that the status of each host is
ok, and the cluster is ready to accept work.
-
Test the cluster to evaluate the Fix Pack.
Troubleshooting: If the Fix Pack is not working correctly, contact IBM Support for
assistance or revert your cluster to its prior state.
To revert your cluster to its prior state, shut down the cluster before reverting the files from
the backups.
- Log in to the deployment host.
- Revert the contents of the deployment (that is, the YUM repository) in the
/var/www/html/lsf_suite_pkgs directory.
- Log in to the LSF
management host and
rebuild the YUM cache.
yum clean all
- Shut down the LSF
management
host.
systemctl stop lsfd
- Get a list of the LSF
management host RPM
packages.
rpm -qa |grep lsf|grep '10.2.0' |grep -v lsf-conf
- Navigate to the deployer directory for your architecture (for example,
/var/www/html/lsf_suite_pkgs/{arch}) and find the old
version numbers for the RPM packages.
- For each RPM package, restore the package to the old
version.
yum downgrade {package_name}-{old_package_name}
- Restart the LSF
management
host.
systemctl restart lsfd
Troubleshooting: Restoring backup Elasticsearch data
To restore backed up Elasticsearch data, perform the following steps:
- Stop services on each
GUI_Host
machine:perfadmin stop all
pmcadmin stop
systemctl stop logstash-for-lsf.service
systemctl stop metricbeat-for-lsf.service
systemctl stop filebeat-for-lsf.service
- To restore an index, delete the index you want to restore first by entering the following to
delete the
data:
curl -XDELETE [GUI_ROLE machine IP]:ES_PORT/[index_name]
curl -X POST "[GUI_ROLE machine IP]:ES_PORT/_snapshot/es_backup/data_backup/_restore" -H 'Content-Type: application/json' -d' { "indices": "index_name*", "ignore_unavailable": true, "include_global_state": true }'
For
example, to restore
lsf_events*
indices:
curl -XDELETE [GUI_ROLE machine IP]:ES_PORT/lsf_events*
curl -X POST "[GUI_ROLE machine IP]:ES_PORT/_snapshot/es_backup/data_backup/_restore" -H 'Content-Type: application/json' -d' { "indices": "lsf_events*", "ignore_unavailable": true, "include_global_state": true }'
- Restart the services on each
GUI_Host
machine:perfadmin start all
pmcadmin start
systemctl start logstash-for-lsf.service
systemctl start metricbeat-for-lsf.service
systemctl start filebeat-for-lsf.service
- Clear browser data before logging in.