2.0.2.1 Interim Fix 1 release notes

2.0.2.1 Interim Fix 1 comes with connector node firmware upgrade and a new Cyclops package to support OCP/OCS 4.12 upgrade.

What's new
  • RHEL is updated to version 8.8.
  • Cyclops package to support OCP/OCS 4.12. For more information, see Upgrading Cloud Pak for Data System.
  • With Cloud Pak for Data System 2.0.2.1 IF1, Lenovo connector nodes (SR650) boot mode is changed from Legacy to UEFI.
  • Connector nodes have new firmware upgrade:
    • XCC
    • UEFI
    Also, Lenevo support for new NVMe drives from Samsung and Solidigm.
Fixed issues
  • Fixed fabric switch memory leaks that caused the switches to run out of resources.
Known issues
Installing OCP fails during 2.0.2.1 IF1 upgrade due to MCP issue.
This problem occurs when there is a timeout waiting for all the nodes to be in Ready state. It can potentially occur during any MCP update. You can see a FATAL_ERROR in apupgrade.log due to an MCP timeout. You can also see that the nodes are in NotReady,SchedulingDisabled state, after running oc get nodes, for example:
[root@gt01-node1 ~]# oc get nodes
NAME                STATUS                        ROLES    AGE   VERSION
e1n1-master.fbond   Ready                         master   38h   v1.19.0+d670f74
e1n2-master.fbond   Ready                         master   38h   v1.19.0+d670f74
e1n3-master.fbond   Ready                         master   38h   v1.19.0+d670f74
e1n4.fbond          Ready                         worker   37h   v1.19.0+d670f74
e2n1.fbond          NotReady,SchedulingDisabled   worker   37h   v1.19.0+d670f74
e2n2.fbond          Ready                         worker   37h   v1.19.0+d670f74
e2n3.fbond          Ready                         worker   37h   v1.19.0+d670f74
e2n4.fbond          Ready                         worker   37h   v1.19.0+d670f74
Also, crio.service is inactive, for example:
[root@gt01-node1 ~]# ssh core@e2n1
[core@e2n1 ~]$ sudo su
[root@e2n1 core]#  systemctl status crio
â crio.service - Open Container Initiative Daemon
   Loaded: loaded (/usr/lib/systemd/system/crio.service; disabled; vendor preset: disabled)
  Drop-In: /etc/systemd/system/crio.service.d
           ââ10-mco-default-env.conf, 20-nodenet.conf
   Active: inactive (dead)
     Docs: https://github.com/cri-o/cri-o
[root@e2n1 core]#
At this point, the node is in NotReady state with ping to the node working.
Workaround
  1. Run the following commands for the node in the NotReady state:
    ssh core@<node>.fbond
    sudo systemctl stop crio-wipe
    sudo systemctl stop crio
    sudo systemctl stop kubelet
    sudo chattr -iR /var/lib/containers/*
    sudo rm -rf /var/lib/containers/*
    sudo systemctl start crio-wipe
    sudo systemctl start crio
    sudo systemctl start kubelet
    exit
  2. Restart the upgrade.
The "Unhealthy component detected" warning after upgrading to 2.0.2.1 IF1 from version 2.0.2.1
After upgrading to 2.0.2.1 IF1 from Cloud Pak for Data System 2.0.2.1, the ap issue command might return the following warning:
[root@csr-node1 ~]# ap issues
Open alerts (issues) and unacknowledged events
+------+---------------------+--------------------+-------------------------------------------+-------------------------------+----------+--------------+
| ID   |          Date (EDT) |               Type | Reason Code and Title                     | Target                        | Severity | Acknowledged |
+------+---------------------+--------------------+-------------------------------------------+-------------------------------+----------+--------------+
| 1297 | 2023-10-20 15:17:39 | HW_NEEDS_ATTENTION | 201: Unhealthy component detected         | hw://enclosure7.node1.bmc1    |  WARNING |          N/A |
| 1298 | 2023-10-20 15:17:39 | HW_NEEDS_ATTENTION | 201: Unhealthy component detected         | hw://enclosure8.node1.bmc1    |  WARNING |          N/A |
| 1299 | 2023-10-20 15:19:36 | HW_NEEDS_ATTENTION | 201: Unhealthy component detected         | hw://enclosure2.node3.bmc1    |  WARNING |          N/A |
| 1300 | 2023-10-20 15:19:36 | HW_NEEDS_ATTENTION | 201: Unhealthy component detected         | hw://enclosure6.node2.bmc1    |  WARNING |          N/A |
| 1301 | 2023-10-20 15:19:50 | HW_NEEDS_ATTENTION | 201: Unhealthy component detected         | hw://enclosure5.node3.bmc1    |  WARNING |          N/A |
| 1302 | 2023-10-20 15:19:50 | HW_NEEDS_ATTENTION | 201: Unhealthy component detected         | hw://enclosure6.node4.bmc1    |  WARNING |          N/A |
| 1303 | 2023-10-20 15:19:50 | HW_NEEDS_ATTENTION | 201: Unhealthy component detected         | hw://enclosure1.node4.bmc1    |  WARNING |          N/A |
| 1304 | 2023-10-20 15:20:05 | HW_NEEDS_ATTENTION | 201: Unhealthy component detected         | hw://enclosure5.node2.bmc1    |  WARNING |          N/A |
| 1305 | 2023-10-20 15:20:05 | HW_NEEDS_ATTENTION | 201: Unhealthy component detected         | hw://enclosure6.node3.bmc1    |  WARNING |          N/A |
| 1306 | 2023-10-20 15:20:05 | HW_NEEDS_ATTENTION | 201: Unhealthy component detected         | hw://enclosure6.node1.bmc1    |  WARNING |          N/A |
| 1307 | 2023-10-20 15:20:20 | HW_NEEDS_ATTENTION | 201: Unhealthy component detected         | hw://enclosure5.node4.bmc1    |  WARNING |          N/A |
| 1308 | 2023-10-20 15:20:20 | HW_NEEDS_ATTENTION | 201: Unhealthy component detected         | hw://enclosure2.node4.bmc1    |  WARNING |          N/A |
| 1309 | 2023-10-20 15:20:20 | HW_NEEDS_ATTENTION | 201: Unhealthy component detected         | hw://enclosure2.node2.bmc1    |  WARNING |          N/A |
| 1310 | 2023-10-20 15:20:34 | HW_NEEDS_ATTENTION | 201: Unhealthy component detected         | hw://enclosure2.node1.bmc1    |  WARNING |          N/A |
| 1311 | 2023-10-20 15:21:11 | HW_NEEDS_ATTENTION | 201: Unhealthy component detected         | hw://enclosure1.node1.bmc1    |  WARNING |          N/A |
| 1312 | 2023-10-20 15:21:25 | HW_NEEDS_ATTENTION | 201: Unhealthy component detected         | hw://enclosure5.node1.bmc1    |  WARNING |          N/A |
| 1313 | 2023-10-20 15:32:22 | SW_NEEDS_ATTENTION | 495: Shared NPS config pool is not paused | sw://openshift.mcp/nps-shared |    MINOR |          N/A |
| 1318 | 2023-10-21 15:03:36 | HW_NEEDS_ATTENTION | 201: Unhealthy component detected         | hw://enclosure1.node3.bmc1    |  WARNING |          N/A |
| 1319 | 2023-10-22 19:31:44 | HW_NEEDS_ATTENTION | 201: Unhealthy component detected         | hw://enclosure1.node2.bmc1    |  WARNING |          N/A |
+------+---------------------+--------------------+-------------------------------------------+-------------------------------+----------+--------------+
Workaround
Run the following commands after the upgrade:
  1. oc get pods -n ap-magneto -o custom-columns=POD:.metadata.name,POD:.metadata.namespace --no-headers | awk '{print $1 " --namespace=" $2}'| xargs oc delete pod
    
  2. oc get pods -n ap-hardware-monitoring -o custom-columns=POD:.metadata.name,POD:.metadata.namespace --no-headers | awk '{print $1 " --namespace=" $2}'| xargs oc delete pod
    
Mellanox as 5.8 in <major>.<minor> version format
ap version -s and ap version -d display the version of Mellanox as 5.8 in <major>.<minor> version format.
Upgrading to 2.0.2.1 IF1 might fail due to e2n2 not ready
The upgrade to 2.0.2.1 IF1 might fail with an error that states e2n2.fbond openshift node(s) found not Ready.
Performing operator upgrade pre-checks for component magneto
2023-11-02 10:59:08 INFO: magneto:Starting Magneto operator pre-install checks...
2023-11-02 10:59:08 INFO: Checking if pods are healthy in ap-magneto namespace
2023-11-02 10:59:08 INFO: All ap-magneto pods healthy
2023-11-02 10:59:08 INFO: magneto:Done executing Magneto operator pre-install checks.
2023-11-02 10:59:08 ATTENTION: Attempting self-healing fix for operation "check_oc_nodes_active"
2023-11-02 10:59:08 ERROR: Error running command [sudo rm -rf /var/lib/containers/*] on node core@['e2n2.fbond'].
2023-11-02 10:59:08 ERROR: sss_ssh_knownhostsproxy: Could not resolve hostname [e2n2.fbond]
2023-11-02 10:59:08 ERROR: kex_exchange_identification: Connection closed by remote host
2023-11-02 10:59:08 ERROR: ERROR: Error running command [sudo rm -rf /var/lib/containers/*] on node core@['e2n2.fbond'].
2023-11-02 10:59:08 ERROR: sss_ssh_knownhostsproxy: Could not resolve hostname [e2n2.fbond]
2023-11-02 10:59:08 ERROR: kex_exchange_identification: Connection closed by remote host
2023-11-02 10:59:08 ERROR:
2023-11-02 10:59:08 FATAL ERROR: Upgrade prerequisites not met. The system is not ready to attempt an upgrade.
2023-11-02 10:59:08 FATAL ERROR: check_oc_nodes_active : e2n2.fbond openshift node(s) found not Ready
2023-11-02 10:59:08 FATAL ERROR:
2023-11-02 10:59:08 FATAL ERROR: More Info: See trace messages at /var/log/appliance/apupgrade/20231102/apupgrade20231102081610.log.tracelog for additional troubleshooting information.
2023-11-02 10:59:08 logfile is /var/log/appliance/apupgrade/20231102/apupgrade20231102081610.log
2023-11-02 10:59:08 ========================================================

ATTENTION : Errors encountered during upgrade operation

========================================================

Logfile: /var/log/appliance/apupgrade/20231102/apupgrade20231102081610.log
Upgrade Command: apupgrade --upgrade --use-version 2.0.2.1.IF1_release --upgrade-directory /localrepo --phase platform

1. check_oc_nodes_active
        Upgrade Detail: Pre-install Checks
        Caller Info:The call was made from 'YosemiteBundleUpgradeChecker.check_oc_nodes_active' on line 372 with file located at '/localrepo/2.0.2.1.IF1_release/EXTRACT/platform/upgrade/modules/ibm/ca/checker/yosemite_bundleupgradechecker.py'
        Message: e2n2.fbond openshift node(s) found not Ready

========================================================
Workaround
  1. Identify one or more nodes that are disabled and caused the failure.
  2. Enable those nodes.
  3. Confirm that the nodes are online and restart upgrade.
The ap version -s might not display the upgrade information for CPDS version 2.0.2.1 IF1, and the upgrade fails on the component component_version_cronjob.py with the following error in the tracelog:
[root@e1n1-QVR component_version_cronjob]# /usr/bin/python3 /opt/ibm/appliance/apupgrade/modules/ibm/ca/component_version_cronjob/component_version_cronjob.py
Traceback (most recent call last):
  File "/opt/ibm/appliance/apupgrade/modules/ibm/ca/component_version_cronjob/component_version_cronjob.py", line 277, in <module>
    software_version = versions["appliancesoftwareversion"][first_node].split("-", 1)[0]
KeyError: 'e10n1'
Workaround
  1. Run the following command to replace the node e1n1:
    sed -i -e 's|first_node = get_nodes()\[0\]|first_node = "e1n1"|g' /opt/ibm/appliance/apupgrade/modules/ibm/ca/component_version_cronjob/component_version_cronjob.py
  2. Rerun the script component_version_cronjob.py with the following command:
    python3 /opt/ibm/appliance/apupgrade/modules/ibm/ca/component_version_cronjob/component_version_cronjob.py

Before you begin

  • Download the following package 2.0.2.1.IF1-WS-ICPDS-fpXXX, where XXX stands for the latest package number, from Fix Central.
  • The system needs to be on version 2.0.2.1 before applying the interim fix.
  • If NPS® has non-default admin account credentials, the following actions must be completed before you can upgrade:
    1. Ensure that you have the NPS database admin user password.
    2. In /home/nz/.bashrc file inside the container, set NZ_USER=admin and NZ_PASSWORD=<customer_password>.
Table 1. Upgrade time
Dell/Lenovo Upgrade time
Lenovo Base + 8 7 hours (NPS, firmware downgrade)
Lenovo Base 5 hours (firmware downgrade)

Procedure

  1. Connect to node e1n1 via the management address and not the application address or floating address.
  2. Verify that e1n1 is the hub:
    1. Check for the hub node by verifying that the dhcpd service is running:
      systemctl is-active dhcpd
    2. If the dhcpd service is running on a node other than e1n1, bring the service down on that other node:
      systemctl stop dhcpd
    3. On e1n1, run:
      systemctl start dhcpd
  3. Download the icpds-release-2.0.2.1.IF1.tar.gz bundle and copy it to /localrepo on e1n1.
    Note: Make sure that you delete all bundle files from previous releases.
  4. From the /localrepo directory on e1n1, run:
    mkdir /localrepo/2.0.2.1.IF1_release

    and move the system bundle into that directory. The directory that is used here must be uniquely named - for example, no previous upgrades on the system can have been run out of a directory with the same name.

  5. Verify the status of your appliance by running:
    • ap issues
    • ap version -s
    • ap sw
  6. Optional: Run upgrade details to view details about the specific upgrade version:
    apupgrade --upgrade-details --upgrade-directory /localrepo --use-version 2.0.2.1.IF1_release --phase platform
  7. Run preliminary checks before you start the upgrade process. The preliminary check option checks for possible issues and attempts to automatically fix any known issues during pre-checks.
    apupgrade --preliminary-check-with-fixes --upgrade-directory /localrepo --use-version 2.0.2.1.IF1_release --phase platform
  8. Start the upgrade process:
    apupgrade --upgrade --upgrade-directory /localrepo --use-version 2.0.2.1.IF1_release --phase platform
  9. Wait for the upgrade to complete successfully.
  10. Run:
    ap version -s

    and verify that the IF is listed in Interim Fixes.

    Note: If you want to upgrade NPS after upgrading to 2.0.2.1.IF1, the oc client must be logged in as an apadmin user before you run the nzinstall command.