Updating IBM Spectrum LSF Suite for Enterprise on a shared file system

Update IBM Spectrum LSF Suite for Enterprise on a shared file system, which affects all hosts in the cluster.

Before you begin

IBM Spectrum LSF Suite for Enterprise must be installed on a shared file system. For more details, refer to Determining the cluster configuration.

About this task

Because all management hosts use the same binary files and configuration files in the shared file system, there are no separate hosts on which you can test the Fix Pack. Therefore, the Fix Pack applies to the entire live cluster.

Procedure

  1. Download the Fix Pack from IBM Fix Central.

    For more details, refer to Getting fixes from IBM Fix Central.

  2. Log in to the deployer host.
  3. Back up the contents of the shared directory.

    For an installation on a shared file system, there is no separate host on which you can test the Fix Pack, which means that any updates are applied to the live cluster. Back up the contents in case there are problems with the Fix Pack and you need to roll back to the previous version.

    Navigate to your shared LSF directory and archive the files.

    If your LSF shared directory is /share/lsf, run the following command to back up the shared directory:

    cd /share/lsf ; tar zcvf lsf-backup.tgz
  4. Back up Elasticsearch.

    As of version 10.2 Fix Pack 10, Elasticsearch, Logstash, and Kibana (ELK) are no longer bundled with the installation package. Customers who wish to use a newer version or want to use specific Elasticsearch/Kibana/Logstash features must download and install them separately. Otherwise, customers can still use their existing 10.2 FP9 installed ELK package.

    Note: The supported ELK version for version 10.2 Fix Pack 10 is 7.2.x or higher (but less than version 8). IBM Spectrum LSF Suite for Enterprise 10.2 Fix Pack 10 was fully tested on ELK 7.2.1.

    See Installing Elasticsearch, Kibana, and Logstash for instructions on installing an external version of Elasticsearch and configuration requirements for upgrading from a previous version of IBM Spectrum LSF Suite for Enterprise using a bundled version of Elasticsearch.

    Updating Elasticsearch will perform a re-indexing of the current indices. It is strongly recommended to perform a data backup before proceeding. On configurations with multiple nodes Elasticsearch, the backup directory must be mounted on each node using NFS. Creating the snapshot will create the backup in each node, onto the NFS directory. Refer to https://www.elastic.co/guide/en/elasticsearch/reference/6.6/modules-snapshots.html for more details.
    Note: The default ES_PORT is 9200.
    1. Log in to every GUI_Role machine as root.
    2. Configure the Elasticsearch snapshot repository.
      • If there is only one GUI_Role machine, put the snapshot repository on a local disk.
        1. Create the directory /opt/ibm/elastic/elasticsearch_repo with write and execute permission for lsfadmin.
        2. In /opt/ibm/elastic/elasticsearch/config/elasticsearch.yml change the path from /opt/ibm/elastic/elasticsearch_repo to /opt/ibm/elastic/elasticsearch/config/elasticsearch.yml
      • If there are multiple GUI_Role machines, the snapshot repository MUST be on a shared file system (NFS) that all GUI_Role machines can access.
        1. On each GUI_Role machine, define the same shared location.

          Create a directory [share_dir]/elasticsearch_repo with write and execute permission for lsfadmin. For example: /mnt/elasticsearch_repo

        2. In /opt/ibm/elastic/elasticsearch/config/elasticsearch.yml, change the path from /mnt/elastic/elasticsearch_repo to /opt/ibm/elastic/elasticsearch/config/elasticsearch.yml.
    3. Restart Elasticsearch to make the above changes take effect on each GUI_ROLE machine:
      systemctl restart elasticsearch-for-lsf.service
    4. Stop the following services on each GUI_Host machine:
      perfadmin stop all
      pmcadmin stop
      systemctl stop logstash-for-lsf.service
      systemctl stop metricbeat-for-lsf.service
      systemctl stop filebeat-for-lsf.service
      
    5. Log in to a GUI_Role machine.
    6. Create the repository location: es_backup in Elasticsearch. At a command prompt, enter the command:
      curl -XPUT "[GUI_ROLE machine IP]:ES_PORT/_snapshot/es_backup" -H 'Content-Type: application/json' -d '{"type": "fs","settings": {"location": "es_backup_location","include_global_state": true,"compress": true}}'
      
    7. Create a snapshot: es_snapshot:
      curl -XPOST [GUI_ROLE machine IP]:ES_PORT/_snapshot/es_backup/data_backup?wait_for_completion=true -H 'Content-Type: application/json' -d '{ "indices": "lsf*,mo*,ibm*", "ignore_unavailable": true, "include_global_state": false }'
      
    8. Check the status of the snapshot:
      curl -XGET [GUI_ROLE machine IP]:ES_PORT/_snapshot/es_backup/data_backup?pretty
      
    9. Restart the services on each GUI_Host machine:
      perfadmin start all
      pmcadmin start
      systemctl start logstash-for-lsf.service
      systemctl start metricbeat-for-lsf.service
      systemctl start filebeat-for-lsf.service
      
  5. From the Fix Pack downloaded in step 1, run the suite_fix.bin or suite_fixpack.bin file on the deployer host.
  6. From the /opt/ibm/lsf_installer/playbook directory on the deployer host, run the installation with the lsf-upgrade.yml playbook to update your cluster with the Fix Pack.
    ansible-playbook -i lsf-inventory lsf-upgrade.yml

    This playbook shuts down the LSF daemons, updates and rebuilds the contents of the shared directory, then restarts the LSF daemons.

    Important:

    By default, any parameter changes in lsf-config.yml are not reflected by running lsf-upgrade.yml.

    Therefore, if you have made any parameter changes in lsf-config.yml, run lsf-upgrade.yml in the command line with an external variable force_run_deploy=Y. This will run lsf-upgrade.yml and lsf-deploy.yml sequentially.

    ansible-playbook -i lsf-inventory  lsf-upgrade.yml -e force_run_deploy=Y

    Note as well that force_run_deploy=Y requires more time than the default command (that is, without the external variable set) when running lsf-upgrade.yml.

  7. Run some commands to verify the update.
    1. Log out of the deployment host, and log in to a host in the cluster.
    2. Run the lsid to see your cluster name and management host name.
    3. Run the lshosts command to see the LSF management hosts (they are members of the management group indicated by the mg resource). The LSF server hosts and client hosts are also listed.
    4. Run the bhosts command to check that the status of each host is ok, and the cluster is ready to accept work.
    5. Log in to one of the server hosts to check that it is using the shared directory.
      For example,
      # ssh hosta1
      # cd /opt/ibm/lsf_suite
      # ls
      ext  lsf
      # ls -al 
      total 0
      drwxr-xr-x. 3 lsfadmin root 28 Nov   2 13:26 .
      drwxr-xr-x. 6 root     root 92 Nov   2 13:26 ..
      drwxr-xr-x. 2 lsfadmin root  6 Nov   2 13:26 ext
      lrwxrwxrwx. 1 root     root 39 Nov   2 13:26 lsf -> /gpfs/lsf_suite/lsf
      See that the lsf directory actually comes from the shared directory /gpfs/lsf_suite/lsf.
  8. Test the cluster to evaluate the Fix Pack.

Troubleshooting: If the Fix Pack is not working correctly, contact IBM Support for assistance or revert your cluster to its prior state.

To revert your cluster to its prior state, shut down the cluster before reverting the files from the backups.

  1. Shut down the LSF cluster.
    ansible all -i lsf-inventory -m command -a "systemctl stop lsfd"
  2. Back up the lsb.events and lsb.acct files from the shared directory.

    This ensures that your cluster retains information on any new jobs that were submitted while your cluster was using the new Fix Pack.

    For example, if your LSF shared directory is /share/lsf and your cluster name is myCluster, run the following command to back up the lsb.events and lsb.acct files:

    cd /share/lsf/myCluster/logdir ; tar zcvf ../../lsf-backup-logs.tgz lsb.acct lsb.events
  3. Restore the contents of the shared directory from the backups.

    For example, if your LSF shared directory is /share/lsfand your cluster name is myCluster, run the following commands to revert the shared directory from the backups:

    cd /share/lsf ; tar zxvf lsf-backup.tgz ; tar zxvf lsf-backup-logs.tgz -C myCluster/logdir
  4. Restart the LSF cluster.
    ansible all -i lsf-inventory -m command -a "systemctl start lsfd"

Troubleshooting: Restoring backup Elasticsearch data

To restore backed up Elasticsearch data, perform the following steps:
  1. Stop services on each GUI_Host machine:
    perfadmin stop all
    pmcadmin stop
    systemctl stop logstash-for-lsf.service
    systemctl stop metricbeat-for-lsf.service
    systemctl stop filebeat-for-lsf.service
  2. To restore an index, delete the index you want to restore first by entering the following to delete the data:
    curl -XDELETE GUI_ROLE_machine_IP]:ES_PORT/[index_name]
    curl -X POST "GUI_ROLE_machine_IP:ES_PORT/_snapshot/es_backup/data_backup/_restore" -H 'Content-Type: application/json' -d' { "indices": "index_name*", "ignore_unavailable": true, "include_global_state": true }'
    For example, to restore lsf_events* indices:
    curl -XDELETE GUI_ROLE_machine_IP]:ES_PORT/lsf_events*
    curl -X POST "GUI_ROLE_machine_IP:ES_PORT/_snapshot/es_backup/data_backup/_restore" -H 'Content-Type: application/json' -d' { "indices": "lsf_events*", "ignore_unavailable": true, "include_global_state": true }'
  3. Restart the services on each GUI_Host machine:
    perfadmin start all
    pmcadmin start
    systemctl start logstash-for-lsf.service
    systemctl start metricbeat-for-lsf.service
    systemctl start filebeat-for-lsf.service
  4. Clear browser data before logging in.