Upgrading to version 1.0.8

Upgrade to version 1.0.8 is performed by IBM Support.

Draft comment: natalia.szczepanska@ibm.com
Is that support or the clients?

The system upgrade timings are limited to three levels:
  • Base
  • Base + 2
  • Base + 8
Table 1. Cloud Pak for Data System upgrade time
System type Dell/Lenovo Approximate upgrade time
Base system Lenovo 4 hours
Base + 2 system Dell 3 hours
Base + 2 system Lenovo 4 hours 30 minutes
Base + 8 system Lenovo 4 hours 30 minutes
The average upgrade time is approximately 4 hours 30 minutes.
Note: The upgrade might take more time in case of a larger deployment.

Before you begin

Upgrade prerequisites:
  • If you are running Cloud Pak for Data System 1.0.7.8 version, before you start upgrading to 1.0.8, you must apply 1.0.7.8 Interim Fix 2. For more information, see 1.0.7.8 Interim Fix 2 release notes.
Draft comment: natalia.szczepanska@ibm.com
Antyhing to be added?
Network setup prerequisites:
  • Before you start the upgrade, from /opt/ibm/appliance/platform/apos-comms/customer_network_config/ansible directory, you must run:
    ANSIBLE_HASH_BEHAVIOUR=merge ansible-playbook -i ./System_Name.yml playbooks/house_config.yml --check -v
    If any changes are listed in --check --v, ensure that they are expected. If they are unexpected, you must edit the YAML file so that it contains only the expected changes. You might rerun this command as necessary until you see no errors.

Netezza prerequisites:

About this task

Upgrade to 1.0.8 is supported only for systems on 1.0.7.6 and above.
Draft comment: natalia.szczepanska@ibm.com
is that a true statement?
Only the system bundle is upgraded in 1.0.8. There is no need to download the following packages:
icpds_vm
icpds_rhos_repo
icpds_services
icpds_services_addon_cyclops

Procedure

  1. Connect to the node1 management interface by using the custom_hostname or ip values from your System_Name.yml
  2. Verify that e1n1 is the hub:
    1. Check for the hub node by verifying that the dhcpd service is running:
      systemctl is-active dhcpd
    2. If the dhcpd service is running on a node other than e1n1, bring the service down on that other node:
      systemctl stop dhcpd
    3. On e1n1, run:
      systemctl start dhcpd
  3. Download the system bundle from Fix Central
    Draft comment: natalia.szczepanska@ibm.com
    add the link, once released.
    and copy it to /localrepo on e1n1.
    Note: The upgrade bundle requires a significant amount of free space. Make sure you delete all bundle files from previous releases.
  4. From the /localrepo directory on e1n1, run:
    mkdir 1.0.8.0_release
    and move the system bundle into that directory. The directory that is used here must be uniquely named - for example, no previous upgrades on the system can have been run out of a directory with the same name.
  5. Run:
    apupgrade --upgrade-apupgrade --upgrade-directory /localrepo --use-version 1.0.8.0_release --bundle system
    The value for the --use-version parameter is the same as the name of the directory you created in step 4.
  6. Run upgrade details to view details about the specific upgrade version:
    apupgrade --upgrade-details --upgrade-directory /localrepo --use-version 1.0.8.0_release --bundle system

    To facilitate working through the potential upgrade issues, you can find a text file with the applicable workarounds in the following location: /localrepo/1.0.8.0_release/EXTRACT/system/Workarounds-1.0.8.0.txt

  7. Before you start the upgrade process, depending on your requirements:
    • Run the preliminary checks with --preliminary-check option:
      apupgrade --preliminary-check --upgrade-directory /localrepo --use-version 1.0.8.0_release --bundle system
      if you just want to check for potential issues and cannot accept any system disruptions. This check is non-invasive and you can rerun it as necessary. You can expect the following output after you run the preliminary checks.
      All preliminary checks complete
      Finished running pre-checks.
    • Optional: Run the preliminary checks with --preliminary-check-with-fixes option:
      apupgrade --preliminary-check-with-fixes --upgrade-directory /localrepo --use-version 1.0.8.0_release --bundle system
      
      if you want to check for potential issues and attempt to automatically fix those. Run it if you can accept your system to be disrupted as this command might cause the nodes to reboot.
  8. Run:
    apupgrade --upgrade --upgrade-directory /localrepo --use-version 1.0.8.0_release --bundle system

Results

After the upgrade is complete, some of the following alerts might be opened in the system:
| 439         | SW_NEEDS_ATTENTION         | SW    | Openshift node is not ready                                   | YES      |
| 440         | SW_NEEDS_ATTENTION         | SW    | Openshift service is not ready                                | YES      |
| 446         | SW_NEEDS_ATTENTION         | SW    | ICP4D service is not ready                                    | YES      |
| 451         | SW_NEEDS_ATTENTION         | SW    | Webconsole service is not ready                               | YES      |
| 460         | SW_NEEDS_ATTENTION         | SW    | Portworx component is not healthy    
Close them manually with the following command:
ap issues --close <alert_id>
As part of the upgrade process, VMs are disabled on all nodes and they are shut down. They are expected to stay in the shut off state in 1.0.8.
[root@gt01-node1 ~]# for node in `/opt/ibm/appliance/platform/xcat/scripts/xcat/display_nodes.py`; do echo ${node}; ssh $node virsh list --all; done
e1n1
 Id    Name                           State
----------------------------------------------------
 -     e1n1-1-control                 shut off

e1n2
 Id    Name                           State
----------------------------------------------------
 -     e1n2-1-control                 shut off

e1n3
 Id    Name                           State
----------------------------------------------------
 -     e1n3-1-control                 shut off

e1n4
 Id    Name                           State
----------------------------------------------------
 -     e1n4-1-worker                  shut off

e2n1
e2n2
e2n3
e2n4
Draft comment: natalia.szczepanska@ibm.com
I don't think this is the case in 108. TBC.
In 1.0.8, the Netezza web console runs on one of the three control nodes (or on a connector node if installed). There are two docker containers required for operation of the Netezza console: cyclops and the associated influxdb container. Container images are installed on all control nodes for high availability. When a control node goes out of service, Platform Manager starts the cyclops and influxdb containers on another control node (or a connector node).
[root@gt18-node1 ~]# for node in `/opt/ibm/appliance/platform/xcat/scripts/xcat/display_nodes.py --control`; do echo ${node}; ssh $node docker ps -a | grep -E 'cyclops|influxdb'; done
e1n1
c7d402b47de8        cyclops:4.0.2-20221114b30631-x86_64   "/scripts/start.sh"      4 days ago          Exited (255) 30 hours ago                            cyclops
9f960b843510        influxdb:latest                       "/entrypoint.sh in..."   4 days ago          Exited (255) 30 hours ago   0.0.0.0:8086->8086/tcp   influxdb
e1n2
642f4b0b5087        cyclops:4.0.2-20221114b30631-x86_64   "/scripts/start.sh"      4 days ago          Up 30 hours         80/tcp, 3000/tcp, 5480/tcp, 0.0.0.0:3333->3333/tcp, 0.0.0.0:8843->8443/tcp   cyclops
177a97aaa701        influxdb:latest                       "/entrypoint.sh in..."   4 days ago          Up 30 hours         0.0.0.0:8086->8086/tcp                                                       influxdb
e1n3
d590a49d369f        cyclops:4.0.2-20221114b30631-x86_64   "/scripts/start.sh"      4 days ago          Exited (137) 4 days ago                       cyclops
19e5f305548e        influxdb:latest                       "/entrypoint.sh in..."   4 days ago          Exited (0) 4 days ago                         influxdb
[root@gt18-node1 ~]# 
Draft comment: natalia.szczepanska@ibm.com
updated with 108 input from Jim Geneva.
The ap version -s container reflects the web console version:
[root@gt18-node1 ~]# ap version -s
Appliance software version is 1.0.8.0

All component versions are synchronized.

+-----------------------------+-----------------------------------------------------------------+
| Component Name              | Version                                                         |
+-----------------------------+-----------------------------------------------------------------+
| Appliance platform software | 1.0.8.0-20221130100627b31176                                    |
| aposcomms                   | ibm-apos-named-config               : 1.0.5.1-1                 |
|                             | ibm-apos-network-tools              : 2.0.4.0-1                 |
|                             | ibm-apos-common                     : 1.0.8.2-1                 |
|                             | ibm-apos-udev-rules-config          : 1.0.4.0-1                 |
|                             | ibm-apos-chrony-config              : 1.0.1.0-3                 |
|                             | ibm-apos-dhcpd-config               : 1.0.4.1-1                 |
|                             | ibm-apos-network-config             : 1.1.9.0-1                 |
| apupgrade                   | 1.0.8.0-20221128071127b31129                                    |
| callhome                    | 0.1.0.0                                                         |
| containerapi                | 1.0.23.0-20221103134948b30181                                   |
| cyclops                     | 4.0.2-20221114b30631                                            |
| docker-upgrade              | oci-systemd-hook                    : 0.2.0-1                   |
|                             | oci-umount                          : 2.5-3                     |
|                             | oci-register-machine                : 0-6                       |
|                             | atomic-registries                   : 1.22.1-29                 |
|                             | docker-common                       : 1.13.1-161                |
|                             | docker-client                       : 1.13.1-161                |
|                             | docker-debuginfo                    : 1.13.1-161                |
|                             | docker                              : 1.13.1-161                |
|                             | docker-rhel-push-plugin             : 1.13.1-161                |
|                             | container-storage-setup             : 0.11.0-2                  |
|                             | container-selinux                   : 2.119.2-1.911c772         |
|                             | containers-common                   : 0.1.40-11                 |
|                             | python-pytoml                       : 0.1.14-1                  |
| gpfs                        | 5.1.2-1                                                         |
| gpfsconfig                  | 1.0.8.0-20221130082401b31174                                    |
| hpi                         | hpi-cumulus-mgtsw-firmware          : 2.0.0.0-20221103173754    |
|                             | hpicfg                              : 2.0.4.4-20221028182852b1  |
|                             | hpi-cumulus-fabspine-firmware       : 2.0.0.0-20221103173754    |
|                             | hpi-software                        : 1.0.8.0-20221123203813b13 |
|                             | hpi-lenovo-node-firmware            : 1.8.0.0-20221103173754    |
|                             | hpiutils                            : 2.0.4.4-20221028182843b1  |
|                             | hpi-x86_64-image                    : 2.0.4.5-20221102223218b7  |
|                             | hpi-cumulus-switch-firmware         : 2.0.0.0-20221103173754    |
|                             | hpi-dell-node-firmware              : 1.8.0.0-20221103173754    |
|                             | hpi-cumulus-fabsw-firmware          : 2.0.0.0-20221103173754    |
|                             | dct                                 : 1.0.7.8-20221103022304b7  |
| magneto                     | 1.0.28.1-20221121132020b30919                                   |
| mellanox                    | 1.0.8.0                                                         |
| mvcli                       | 2.3.10.1095                                                     |
| nodeos                      | 1.0.8.0-20221130073105b31174                                    |
| platformbackups             | 1.0.20.0-20221028152919b29972                                   |
| psklm                       | 1.0.21.0-20221122015956b5                                       |
| solarflare                  | 4.15.10.1002                                                    |
| supporttools                | 1.0.23.10-20221115174015b30691                                  |
+-----------------------------+-----------------------------------------------------------------+
Draft comment: natalia.szczepanska@ibm.com
updated with 108 input from Jim Geneva.

Firmware post-upgrade steps

After you finish your 1.0.8 upgrade process, you must ensure that your firmware is also upgraded.

Draft comment: natalia.szczepanska@ibm.com
https://github.ibm.com/privatecloud-ap/cpds-issues/issues/3840

Procedure

  1. Run:
    /opt/ibm/appliance/platform/hpi/sys_hw_check node
    to verify the output. If it is showing PASS or ABOVE for all nodes, no action is required.
    Note: NIC_ETH_EXT 1 and NIC_ETH_EXT 2 report BELOW in this check and it is expected. You cannot upgrade this firmware until a future software release. For example:
      NIC_ETH_EXT 1                                                         [BELOW]
        version              - 16.24.4020 / bundle ver - 16.27.2008          [INFO]
      NIC_ETH_EXT 2                                                         [BELOW]
        version              - 16.24.4020 / bundle ver - 16.27.2008          [INFO]
    Draft comment: natalia.szczepanska@ibm.com
    Would running that command also show BELOW? I believe so, and then you would go and check the report log file to determine whether one or more nodes are deficient in firmware or in BMC settings, right?
  2. If any of the nodes are showing BELOW, check the report log
    Draft comment: natalia.szczepanska@ibm.com
    what's the exact name of the report log file?
    file to determine whether the nodes firmware must be upgraded or BMC settings updated.
  3. If any of the nodes report BELOW for firmware, run:
    /opt/ibm/appliance/platform/hpi/sys_hw_config -f -t target nodes
    For example:
    /opt/ibm/appliance/platform/hpi/sys_hw_config -f -t e1n{1..4}
  4. If any of the nodes only report incorrect BMC settings, run:
    /opt/ibm/appliance/platform/hpi/sys_hw_config -t target nodes
    For example:
    /opt/ibm/appliance/platform/hpi/sys_hw_config -t e1n{1..4}
  5. After you complete sys_hw_config and the nodes rebooted, run:
    /opt/ibm/appliance/platform/hpi/sys_hw_check node
    to confirm the updates took effect and the output is showing only PASS or ABOVE for all nodes.

Nodes personalities check post-upgrade steps

1.0.8 upgrade on Dell systems does not remove WORKER personality. You must check the nodes to ensure that there are no WORKER personalities after the upgrade completes. This issue happens when the worker vm is gone from e4n1, but the node still has the WORKER personality set.

About this task

If you run virsh list --all and see the following output:

[root@e4n1 ~]# virsh list --all
 Id    Name                           State
----------------------------------------------------
[root@nz5-node1 ~]#  ap node -d
+------------------+---------+---------------+-----------+-----------+--------+---------------+---------------+
| Node             |   State | Personality   | Monitored | Is Master | Is HUB | Is VDB Master | Is NRS Master |
+------------------+---------+---------------+-----------+-----------+--------+---------------+---------------+
| enclosure1.node1 | ENABLED | CONTROL,UNSET |       YES |       YES |    YES |            NO |            NO |
| enclosure2.node1 | ENABLED | CONTROL,UNSET |       YES |        NO |     NO |            NO |            NO |
| enclosure3.node1 | ENABLED | CONTROL,UNSET |       YES |        NO |     NO |            NO |            NO |
| enclosure4.node1 | ENABLED | WORKER,UNSET  |       YES |        NO |     NO |            NO |            NO |
+------------------+---------+---------------+-----------+-----------+--------+---------------+---------------+
Or the following error in the tracelog:
2022-12-09 14:58:30 INFO: Checking for UNSET node(s)
                           LOGGING FROM: yosemite_bundle_upgrade.py:get_unset_worker_node_names:544
2022-12-09 14:58:30 TRACE: Running command [ap node | grep UNSET | cut -f2 -d '|'].

                           LOGGING FROM: yosemite_bundle_upgrade.py:get_unset_worker_node_names:544
2022-12-09 14:58:30 TRACE: RC: 0.
                           STDOUT: [ enclosure1.node1
                            enclosure2.node1
                            enclosure3.node1
                            enclosure4.node1
                           ]
                           STDERR: []

                           LOGGING FROM: yosemite_bundle_upgrade.py:get_unset_worker_node_names:544
2022-12-09 14:58:30 TRACE: Running command [ap node set_personality  UNSET --magneto_only -f].

                           LOGGING FROM: yosemite_bundle_upgrade.py:unset_worker_node_personalities:530
2022-12-09 14:58:31 TRACE: RC: 1.
                           STDOUT: [
                           Generated: 2022-12-09 14:58:31

                           ]
                           STDERR: ['UNSET' is not a valid node location
                           ]

                           LOGGING FROM: yosemite_bundle_upgrade.py:unset_worker_node_personalities:530
2022-12-09 14:58:31 ERROR: Error running command [ap node set_personality  UNSET --magneto_only -f].
You must apply the following workaround.
  • Set the node personalities after 1.0.8 upgrade is complete. Depending on the existing node personality, run:
    • For CONTROL,WORKER run:
      ap node set_personality <node> CONTROL,UNSET --magneto_only -f
    • For WORKER,WORKER run:
      ap node set_personality <node> UNSET,UNSET --magneto_only -f
    • For WORKER,UNSET run:
      ap node set_personality <node> UNSET,UNSET --magneto_only -f
      For example:
      [root@gt25-node1 upgrade]# ap node
      +------------------+---------+----------------+-----------+-----------+
      | Node             |   State | Personality    | Monitored | Is Master |
      +------------------+---------+----------------+-----------+-----------+
      | enclosure1.node1 | ENABLED | CONTROL,WORKER |       YES |       YES |
      | enclosure2.node1 | ENABLED | CONTROL,WORKER |       YES |        NO |
      | enclosure3.node1 | ENABLED | CONTROL,WORKER |       YES |        NO |
      | enclosure4.node1 | ENABLED | UNSET,UNSET    |       YES |        NO |
      | enclosure5.node1 | ENABLED | UNSET,UNSET    |       YES |        NO |
      | enclosure6.node1 | ENABLED | UNSET,UNSET    |       YES |        NO |
      +------------------+---------+----------------+-----------+-----------+
      
      
      [root@gt25-node1 upgrade]# ap node set_personality  enclosure1.node1  CONTROL,UNSET --magneto_only -f
      Node role change request sent successfully
      
      Generated: 2022-12-13 11:08:07
      
      [root@gt25-node1 upgrade]# ap node
      +------------------+---------+----------------+-----------+-----------+
      | Node             |   State | Personality    | Monitored | Is Master |
      +------------------+---------+----------------+-----------+-----------+
      | enclosure1.node1 | ENABLED | CONTROL,UNSET  |       YES |       YES |
      | enclosure2.node1 | ENABLED | CONTROL,WORKER |       YES |        NO |
      | enclosure3.node1 | ENABLED | CONTROL,WORKER |       YES |        NO |
      | enclosure4.node1 | ENABLED | UNSET,UNSET    |       YES |        NO |
      | enclosure5.node1 | ENABLED | UNSET,UNSET    |       YES |        NO |
      | enclosure6.node1 | ENABLED | UNSET,UNSET    |       YES |        NO |
      +------------------+---------+----------------+-----------+-----------+