Version 2.0.2.1 release notes

Cloud Pak for Data System version 2.0.2.1 improves the upgrade process.

For more information about Cloud Pak for Data versions that are supported by Cloud Pak for Data System 2.0.2.1, see Upgrading Cloud Pak for Data System.
Note: To upgrade Cloud Pak for Data, contact the Cloud Pak for Data team.

Upgrading

If you have an existing installation of Cloud Pak for Data System 1.x, contact IBM Support to plan the upgrade. For more information on the process, see Advanced upgrade from versions 1.0.x.
Note: If the system has any self-managed Cloud Pak for Data tenants (on the namespace other than zen or ap-console), those tenants are upgraded during the service bundle upgrade automatically, unless you pin the installation to a specific version in the ZenService custom resource. For more information, see the Manual upgrade section in Choosing an upgrade plan for the Cloud Pak for Data control plane.

The end to end upgrade time for 2.0.1.1 to 2.0.2.1 ranges from 24 to 30 hours. This includes system, firmware, and OCP/OCS component upgrades.

Software components

Fixed issues

  • Fixed the issue of OCS reporting unhealthy due to mon low on available space during the upgrade from 2.0.1.0 to 2.0.2.0.
  • Fixed the issue of OCS reporting unhealthy due to mon with slow ops during the upgrade from 2.0.1.0 to 2.0.2.0.

Known issues

Platform phase upgrade known issues
Nodes reboot at the preliminary-checks step during the 2.0.2.1 upgrade
check_ipmitool_lan is one of the preliminary-checks during 2.0.2.1 upgrade and when this check fails, you can see the following messages in the upgrade log:
LOGGING FROM: node_os_prechecker_fixer.py:fix_check_ipmitool_lan:81
2022-02-28 14:21:48 ERROR: Error running command [ifup fbond] on [u'e1n1']
                           LOGGING FROM: node_os_prechecker_fixer.py:fix_check_ipmitool_lan:81
2022-02-28 14:21:48 ERROR: nodeos:NodeosUpgrader.prechecker.fixer:Unable to fix problem with ipmitool lan. Fatal Problem: Failed to start GPFS.
                           LOGGING FROM: node_os_prechecker_fixer.py:fix_check_ipmitool_lan:85
2022-02-28 14:21:48 TRACE: In method logger.py:log_error:142 from parent method node_os_prechecker_fixer.py:fix_check_ipmitool_lan:85 with args
                               msg = nodeos:NodeosUpgrader.prechecker.fixer:Unable to fix problem with ipmitool lan. Fatal Problem: Failed to start GPFS.


                           LOGGING FROM: node_os_prechecker_fixer.py:fix_check_ipmitool_lan:85
2022-02-28 14:21:48 Upgrade prerequisites not met. The system is not ready to attempt an upgrade.
check_ipmitool_lan : nodeos:NodeosUpgrader.prechecker:Problem encountered during check for problem with lan interface. A reboot of the following nodes is suggested [u'e1n1', u'e2n1', u'e3n1']
                           LOGGING FROM: bundle_upgrade.py:report_any_failed_operation_results:1124
Workaround:
  1. Check the uptime of all the nodes requiring reboot. If uptime is smaller than five minutes, it means that the reboot was done automatically.
  2. If uptime is greater than five minutes, reboot the nodes manually.
  3. After the required nodes reboot, restart upgrade. For example:
    apupgrade --preliminary-check-with-fixes --use-version 2.0.2.1 --upgrade-directory /localrepo --phase platform
MCP update fails during 2.0.2.1 upgrade
During --phase platform upgrade, MCP update fails with the following message:
INFO: Function: check_update_completion failed after 90 tries : Nodes in worker pool are updating
INFO: AbstractUpgrader.postinstaller:MCP update failed, attempting self-recovery
ERROR: AbstractUpgrader.postinstaller:Unable to perform self-recovery, MCP update failed: AbstractUpgrader.postinstaller:Unable to perform self-recovery, the cause of error is unknown
INFO: mcp:Done executing MCP post-install steps.
FATAL ERROR: Errors encountered
FATAL ERROR:
FATAL ERROR: McpUpgrader.postinstall : ERROR: AbstractUpgrader.postinstaller:Unable to perform self-recovery, MCP update failed: AbstractUpgrader.postinstaller:Unable to perform self-recovery, the cause of error is unknown
It can also fail after --phase platform completes, with the following message:
"'oc describe node | grep state' would mention unable to drain the pod zen-metastoredb-xx"
The problem is with the zen-metastoredb pod eviction.
Workaround:
  1. Run:
    oc delete pdb zen-metastoredb-budget -n zen
  2. Run:
    oc delete pdb
          zen-metastoredb-budget -n ap-console
    for the node to drain the pod and continue with the MCP update.
  3. Run:
    ps fx | grep apupgrade
    to determine whether upgrade is still running or not after both oc delete pdb commands. If it is running, the upgrade continues .
  4. If upgrade stopped running, restart it.
2.0.2.1 platform upgrade fails after GPFS reboot because a node is unreachable
When you run into this issue, you get the output on stopping GPFS and checking whether the nodes are available. After the check, you can see one of the nodes is not reachable and the following error:
INFO: Some nodes were not available.
 ['eXnY]
ERROR: Unable to powercycle nodes via ipmitool.
ERROR: 'bmc_addr'
ERROR: The following nodes are still unavailable after a reboot attempt: ['eXnY']
FATAL ERROR: Problem rebooting nodes
Then, the upgrade stops.
Workaround:
  1. Reboot the unreachable node.
  2. Mount the filesystems. Run:
    mmmount all -a
  3. Run:
    mmlsmount all -L
    for example :
    [root@gt03-node1 ~]# mmlsmount all -L
    
    File system ips is mounted on 3 nodes:
      9.0.32.16       e1n1
      9.0.32.17       e1n2
      9.0.32.18       e1n3
    
    File system platform is mounted on 3 nodes:
      9.0.32.18       e1n3
      9.0.32.16       e1n1
      9.0.32.17       e1n2
  4. Restart the upgrade.
2.0.2.1 upgrade fails during platform phase due to unstable MCP state
This failure happens during --phase platform due to the following error:
FATAL ERROR: MCP is not in a stable state. Please resolve this issue before attempting an upgrade.
You can find more information in the corresponding log, for example: /var/log/appliance/apupgrade/20220630/apupgrade20220630082336.log
Workaround:
  1. Check the nodes and look for those that are NotReady by running:
    oc get nodes
    for example :
    [root@gt01-node1 ~]# oc get nodes
    NAME                STATUS     ROLES    AGE   VERSION
    e1n1-master.fbond   Ready      master   38h   v1.19.0+d670f74
    e1n2-master.fbond   Ready      master   38h   v1.19.0+d670f74
    e1n3-master.fbond   NotReady   master   38h   v1.19.0+d670f74
    e1n4.fbond          Ready      worker   38h   v1.19.0+d670f74
    e2n1.fbond          Ready      worker   38h   v1.19.0+d670f74
    e2n2.fbond          Ready      worker   38h   v1.19.0+d670f74
    e2n3.fbond          NotReady   worker   38h   v1.19.0+d670f74
    e2n4.fbond          Ready      worker   38h   v1.19.0+d670f74
  2. Run:
    oc get mcp
    to check the MCP status. For example:
    
    [root@gt01-node1 ~]# oc get mcp
    NAME     CONFIG                                             UPDATED   UPDATING   DEGRADED   MACHINECOUNT   READYMACHINECOUNT   UPDATEDMACHINECOUNT   DEGRADEDMACHINECOUNT   AGE
    master   rendered-master-f17afa51f05e6aed60f2eb1931db6c63   False     True       False      3              2                   3                     0                      38h
    unset    rendered-unset-3e6f7f0f84f4a0796786b102c7679233    False     True       False      2              1                   2                     0                      38h
    worker   rendered-worker-3e6f7f0f84f4a0796786b102c7679233   True      False      False      3              3                   3                     0                      38h
  3. Verify the CSR certificates for the nodes that show True in the DEGRADED or UPDATING columns.
  4. Run:
    oc get csr
    to check whether there are any pending certificates.
  5. Run:
    oc get csr -o name | xargs oc adm certificate approve
    to approve all certificates.
  6. Run:
    oc get mcp
    oc get nodes
    to confirm that all nodes are Ready and all MCPs are UPDATED. For example:
    [root@gt01-node1 ~]# oc get nodes
    NAME                STATUS     ROLES    AGE   VERSION
    e1n1-master.fbond   Ready      master   38h   v1.19.0+d670f74
    e1n2-master.fbond   Ready      master   38h   v1.19.0+d670f74
    e1n3-master.fbond   Ready      master   38h   v1.19.0+d670f74
    e1n4.fbond          Ready      worker   38h   v1.19.0+d670f74
    e2n1.fbond          Ready      worker   38h   v1.19.0+d670f74
    e2n2.fbond          Ready      worker   38h   v1.19.0+d670f74
    e2n3.fbond          Ready      worker   38h   v1.19.0+d670f74
    e2n4.fbond          Ready      worker   38h   v1.19.0+d670f74
    [root@gt01-node1 ~]# oc get mcp
    NAME     CONFIG                                             UPDATED   UPDATING   DEGRADED   MACHINECOUNT   READYMACHINECOUNT   UPDATEDMACHINECOUNT   DEGRADEDMACHINECOUNT   AGE
    master   rendered-master-f17afa51f05e6aed60f2eb1931db6c63   True      False      False      3              3                   3                     0                      38h
    unset    rendered-unset-3e6f7f0f84f4a0796786b102c7679233    True      False      False      3              3                   3                     0                      38h
    worker   rendered-worker-3e6f7f0f84f4a0796786b102c7679233   True      False      False      3              3                   3                     0                      38h
  7. Restart the upgrade.
2.0.2.1 upgrade fails during the OpenShift phase
This failure happens after apupgrade starts the system, brings it to Ready state and then runs the oc login command. When apupgrade terminates with a timeout, oc login seems to be working manually and after you restart the apupgrade, the same oc login command works. If you ran into this issue, you can see the following errors in the apupgrade.log:
  • ERROR: STDERR: [error: Missing or incomplete configuration info. Please point to an existing, complete config file:
  • FATAL ERROR: McpUpgrader.preinstall : Failed to login to openshift server : error: The server uses a certificate signed by unknown authority. You may need to use the --certificate-authority flag to provide the path to a certificate file for the certificate authority, or --insecure-skip-tls-verify to bypass the certificate check and use insecure connections.
  • FATAL ERROR: McpUpgrader.preinstall : Openshift login timed out after waiting for 30 minutes. Please run following command manually and if successful resume upgrade, or contact IBM Support for help
Workaround:
  1. Wait for apupgrade to time out.
  2. Run oc login command, which is written in the log. For example:
    oc login --token=$(cat /root/.sa/token) https://api.localcluster.fbond:6443
  3. Restart the upgrade.
Platform and OCP phase upgrade known issues
Installing OCP fails during 2.0.2.1 upgrade due to MCP issue
This problem occurs when there is a timeout waiting for all the nodes to be in Ready state. It can potentially occur during any MCP update. You can see a FATAL_ERROR in apupgrade.log due to an MCP timeout. You can also see that the nodes are in NotReady,SchedulingDisabled state, after running oc get nodes, for example:
[root@gt01-node1 ~]# oc get nodes
NAME                STATUS                        ROLES    AGE   VERSION
e1n1-master.fbond   Ready                         master   38h   v1.19.0+d670f74
e1n2-master.fbond   Ready                         master   38h   v1.19.0+d670f74
e1n3-master.fbond   Ready                         master   38h   v1.19.0+d670f74
e1n4.fbond          Ready                         worker   37h   v1.19.0+d670f74
e2n1.fbond          NotReady,SchedulingDisabled   worker   37h   v1.19.0+d670f74
e2n2.fbond          Ready                         worker   37h   v1.19.0+d670f74
e2n3.fbond          Ready                         worker   37h   v1.19.0+d670f74
e2n4.fbond          Ready                         worker   37h   v1.19.0+d670f74
Additionally, crio.service is inactive, for example:
[root@gt01-node1 ~]# ssh core@e2n1
[core@e2n1 ~]$ sudo su
[root@e2n1 core]#  systemctl status crio
â crio.service - Open Container Initiative Daemon
   Loaded: loaded (/usr/lib/systemd/system/crio.service; disabled; vendor preset: disabled)
  Drop-In: /etc/systemd/system/crio.service.d
           ââ10-mco-default-env.conf, 20-nodenet.conf
   Active: inactive (dead)
     Docs: https://github.com/cri-o/cri-o
[root@e2n1 core]#
At this point, the node is in NotReady state with ping to the node working.
Workaround:
  1. Run the following commands for the node in the NotReady state:
          ssh core@<node>.fbond
          sudo systemctl stop crio-wipe
          sudo systemctl stop crio
          sudo systemctl stop kubelet
          sudo chattr -iR /var/lib/containers/*
          sudo rm -rf /var/lib/containers/*
          sudo systemctl start crio-wipe
          sudo systemctl start crio
          sudo systemctl start kubelet
          exit
  2. Restart the upgrade.
OCP phase upgrade known issues
2.0.2.1 upgrade fails due to OCP upgrade failure
This problem happens when the OCP component upgrade fails due to image-registry being degraded. You can also see the following output in the tracelog:
                           LOGGING FROM: ocp_installer.py:run_ocp_upgrade_script:77
2022-07-12 11:10:32 TRACE: RC: 1.
                           STDOUT: [2022-07-12 11:07:31.326473 INFO: Targeted OCP version is 4.7.51.
                           
                           2022-07-12 11:07:31.326647 INFO: Starting OCP upgrade ...
                           
                           2022-07-12 11:07:31.326702 INFO: Validating program parameters...
                           
                           2022-07-12 11:07:31.326921 INFO: Validating all the nodes are in Ready state ...
                           
                           2022-07-12 11:07:31.490570 INFO: Validating the system state...
                           
                           2022-07-12 11:10:32.119651 ERROR: Failed to validate the cluster operators.
                           
                           2022-07-12 11:10:32.119708 ERROR: Failed to verify the system state.
                           
                           2022-07-12 11:10:32.119764 INFO: OCP upgrade to version 4.7.51 failed!!.
                           
                           ]
To determine whether you ran into the issue, follow these steps:
  1. Run:
    oc get co | grep  image-registry
    to confirm the image-registry shows the following string:
    image-registry                             4.6.32    True        False         True       42h
  2. Run:
    oc describe co image-registry
    to confirm that the output shows the following message:
        Message:               ImagePrunerDegraded: Job has reached the specified backoff limit
        Reason:                ImagePrunerJobFailed
        Status:                True
        Type:                  Degraded
Workaround:
  1. Check the managementState by running:
    oc get config cluster -o yaml 
  2. If the managementState is Removed change it to Managed and save it. Run:
    oc edit config cluster
    managementState : Managed
  3. If the issue persists and step 2 did not solve it, run:
    oc patch imagepruner.imageregistry/cluster --patch '{"spec":{"suspend":true}}' --type=merge
    oc -n openshift-image-registry delete jobs --all
Ceph is in HEALTH_WARN state due to 1 daemons have recently crashed during 2.0.2.1 upgrade
You might run into a warning in Red Hat OpenShift Container Storage (OCS) in the underlying Ceph cluster due to daemons recently crashed. After the 2.0.2.1 upgrade, there are checks for the health of the Ceph cluster built-in and they will flag this warning and stop the operation.

Symptoms:

You can see the following error in the log:
2022-08-02 10:20:13 INFO: OcsUpgrader.postinstall: Finished post upgrade checks for OCS upgrade
2022-08-02 10:20:13 INFO: ocs:Done executing OCS post-install checks.
2022-08-02 10:20:13 FATAL ERROR: Errors encountered
2022-08-02 10:20:13 FATAL ERROR:
2022-08-02 10:20:13 FATAL ERROR: check_ocs_state : OcsUpgrader.postinstall:Error encountered during OCS cluster health check!
2022-08-02 10:20:13 FATAL ERROR: This error requires manual intervention to resolve. Please contact IBM Support.
To confirm you hit the issue, run:
[root@e1n1 ~]# TOOLS_POD=$(oc get pods -n openshift-storage -l app=rook-ceph-tools -o name)
[root@e1n1 ~]# oc rsh -n openshift-storage $TOOLS_POD ceph status | grep daemon | grep crashed
and see whether the latter command returned anything on stdout. If not, then you have ran into some other issue and you must contact IBM Support.
Workaround:
  1. Run:
    [root@e1n1 ~]# NAMESPACE=openshift-storage
    [root@e1n1 ~]# ROOK_OPERATOR_POD=$(oc -n ${NAMESPACE} get pod -l app=rook-ceph-operator -o jsonpath='{.items[0].metadata.name}')
    [root@e1n1 ~]# oc exec -it ${ROOK_OPERATOR_POD} -n ${NAMESPACE} -- ceph crash archive-all --cluster=${NAMESPACE} --conf=/var/lib/rook/${NAMESPACE}/${NAMESPACE}.config --keyring=/var/lib/rook/${NAMESPACE}/client.admin.keyring
  2. Run:
    [root@e1n1 ~]# oc rsh -n openshift-storage $TOOLS_POD ceph status | grep daemon | grep crashed
    to ensure that there is no output on stdout now.
    Note: If you still see output on stdout, wait for another 30 seconds and try again. If the stdout output persists, contact IBM Support.
aposDnsCheck fails if there is no wildcard, which breaks 2.0.2.1 upgrade
apupgrade calls DnsCheck with cluster down. You can see that the oc get route command does not work. For example:
 random_host = ''.join(random.choice(string.digits + string.ascii_letters) for i in range(32))
            if query_dns(random_host + "." + house_yml['all']['vars']['app_fqdn']) != 0:
                [rc, stdout, stderr] = cmd_runner.run_shell_cmd("oc get routes --all-namespaces | egrep -v localcluster")
                if rc == 0:
                    logger.log_info("[RUNNING] - The oc command listed customer defined service specific entries.")
                else:
                    logger.log_print("[FAIL] - Could not run oc get route command. Please make sure you have permissions and access to the oc command and try again.")
                    issue_found = 1
DnsCheck passes only if you have wildcard. If you do not have a wildcard, this check will fail and break the 2.0.2.1 upgrade.
To determine whether you ran into the issue, follow these steps:
  1. Run:
    oc get route
    and check whether it fails.
  2. Run:
    /opt/ibm/appliance/platform/apos-comms/tools/aposDnsCheck.py
    and check whether base FQDN is resolvable, but wildcards are not, for example:
    Validating /opt/ibm/appliance/platform/apos-comms/customer_network_config/ansible/Customer.yml
    Checking hostname
    [RUNNING] - Trying to query base FQDN
    [PASS] - The base FQDN was resolvable
    [RUNNING] - Trying to query a wildcard entry
    [ERROR] DNS server  <x.y.z.a> appears reachable, but is missing DNS record <xyz>. If you need a different suffix, add a DNS search string or specific explicitly
    [WARN] If you do not specify a wildcard entry, you must specify a DNS entry for each service that is deployed
    [RUNNING] - Trying to query customer defined service specific entries
    [ERROR] Could not run oc get route command. Please make sure you have permissions and access to the oc command and try again.
    
    [ERROR] Issue(s) issues above must be corrected on upstream DNS
  3. Apply the following workaround:
    1. Manually edit the /localrepo/2.0.1.1/EXTRACT/platform/upgrade/aposcomms/aposcomms_prechecker.py file by commenting out line 233:
      Results.append(self.run_dns_check_util())
    2. Edit /opt/ibm/appliance/apupgrade/bin/apupgrade file by commenting out line 837:
      self.verify_bundle(self.working_dir)
  4. If you run step 2., but get the following output:
    [RUNNING] - Trying to query base FQDN
    [ERROR] Error trying to query base FQDN <customer-fqdn>, please make sure the base FQDN indicated in the house yaml can be resolved
    Then you do not have proper DNS base FQDN forwarding records. Stop the upgrade and fix your base FQDN records.
Storage overfilled warning is seen during the 2.0.2.1 upgrade
During 2.0.2.1 upgrade /localrepo/2.0.2.1_release/ becomes overfilled and you can see the following error:
[root@e1n1 ~]# ap issues
Open alerts (issues) and unacknowledged events
+------+---------------------+---------------------+------------------------------------------+-------------------------------+----------+--------------+
| ID   |          Date (UTC) |                Type | Reason Code and Title                    | Target                        | Severity | Acknowledged |
+------+---------------------+---------------------+------------------------------------------+-------------------------------+----------+--------------+
| 1024 | 2022-09-15 22:41:07 | STORAGE_UTILIZATION | 901: Storage utilization above threshold | sw://fs.sda2/enclosure1.node1 |  WARNING |          N/A |
+------+---------------------+---------------------+------------------------------------------+-------------------------------+----------+--------------+
Workaround:
  • Run:
    rm -rf /localrepo/2.0.2.1_release/
    to remove the release directory you created at the start of the upgrade.
Post-upgrade known issues
Unexpected upgrade might be triggered, if your zen operand version is not pinned down to Cloud Pak for Data 4.2.0 (4.0.2 level) in the ap-console namespace
To prevent that, existing 2.x Cloud Pak for Data System customers must pin down the zen operand version to 4.2.0 (Cloud Pak for Data 4.0.2 level) in the ap-console namespace. Cloud Pak for Data upgrade to a higher version should be done only on the zen namespace.
Workaround:
  1. Run:
    oc get zenservice lite-cr -n ap-console -o json | jq .spec.version
    to determine the current version.
  2. Depending on the oc get version results, perform the following actions:
    • If you get a version above 4.2.0, contact IBM Cloud Pak for Data Support.
    • If you get 4.2.0 as the current version, no action is required.
    • If you get a version below 4.2.0, run:
      oc patch zenservice lite-cr  --namespace ap-console  --type=merge --patch '{"spec": {"version":"4.2.0"}}'
Enabling FIPS fails after the 2.0.2.1 upgrade completes
If you upgraded to the 2.0.2.1 version and try to re-enable FIPS, the command fails with the following error: dracut: installkernel failed in module kernel-modules-extra as there are old kernel-modules-extra RPM on your system along with the new kernel RPM after the upgrade. To re-enable FIPS, you must first remove the old RPM from all control nodes after the upgrade. Contact IBM Support for assistance. After this is complete, you can re-enable FIPS. For more information, see Configuring FIPS on 2.0.2 Cloud Pak for Data System.
Workaround:
  • Remove the old kernel-modules-extra* RPM corresponding to the old kernel version. For example, run:
    # yum remove kernel-modules-extra-4.18.0-193.13.2.el8_2.x86_64

Other known issues

Lenovo SD530 data collection for enclosures fails to collect enclosure Fast Failure Data Collection (FFDC) from the System Management Module (SMM)
To avoid that issue, use sys_hw_util. Depending on your requirements, you can:
  • Collect one SMM FFDC from one enclosure, for example enclosure2 by running:
    /opt/ibm/appliance/platform/hpi/sys_hw_util ffdc -t e2n1 --smm
  • Collect multiple SMM FFDC from all enclosures in a BASE+2 (four enclosures) system by running:
    /opt/ibm/appliance/platform/hpi/sys_hw_util ffdc -t e{1..4}n1 --smm
    After you run this command on the console, you get the file location of the collected data. For example:
      Comp   |               FFDC Log Location               |
    ---------+-----------------------------------------------+
     e2smm   | /var/log/appliance/platform/hpi/FFDC_e2smm/   |
Some BMC/SMM configuration settings on CPDS can change unexpectedly
Several Cloud Pak for Data System issues were seen recently whose root cause was related to one or more BMC configuration settings having been changed from their expected values. You can see several symptoms, depending on which settings were changed. One of the cases observed recently on several different systems resulted in one or more NPS SPUs not being able to PXE boot. This might result in Netezza Performance Server going into a DOWN state, depending on how many SPUs are affected.
To determine whether there are BMC settings that were modified from those values that are expected on the Cloud Pak for Data System platform, run:
/opt/ibm/appliance/platform/hpi/sys_hw_check node
if any nodes have any of these BMC settings that are incorrectly configured, the screen output from sys_hw_check indicates that checks for that nodes failed. The report from sys_hw_check itemizes exactly which settings generated an ERROR
Note: Some values of these settings, when corrected automatically reboot the corrected nodes for the changes to take effect. These specific settings, which cannot be changed without requiring a reboot are:
  • DevicesandIOPorts.OnboardLANPort1 and 2
  • EnableDisableAdapterOptionROMSupport.OnboardLANPort1 and 2
  • MellanoxNetworkAdapter--Slot6 PhysicalPort1 LogicalPort1.NetworkLinkType and PhysicalPort2 (Connector Node only)
  • DevicesandIOPorts.Device_Slot2 and 4 (Connector Node only)

Before you run sys_hw_config to correct any of these (or other) BMC settings, check the report log from sys_hw_check if any of the mentioned four BMC settings are flagged as incorrect, then resetting the expected values requires nodes to be rebooted. If so, ensure that these nodes reboots are acceptable before proceeding.

Workaround:
  1. Run:
    /opt/ibm/appliance/platform/hpi/sys_hw_config node
    to fix all of the incorrectly configured settings.
  2. Rerun:
    sys_hw_check
    to verify that the settings are correct.
  3. If NPS went down due to this problem, perform the following actions:
    1. Run:
      nzstop
    2. Run:
      nzstart
      to bring NPS back online.
chronyd does not support poor quality windows NTP servers
Poor quality time sources might be automatically rejected by Cloud Pak for Data System. Provide production quality time sources to avoid that.
Invalid python symbolic link getting set during 2.0.2.1 upgrade, which breaks ansible
This problem happens during 2.0.2.1 upgrade if the ansible command fails with the following error:
[ERROR] Failed running ansible command: [DEPRECATION WARNING]: Distribution redhat 8.6 on host node1 should use
[ERROR] /usr/libexec/platform-python, but is using /usr/bin/python for backward
Workaround:
  1. Verify the symbolic link on e1n1 by running:
    [root@e1n1 ansible]# ls -lah /usr/bin/python
    lrwxrwxrwx. 1 root root 36 Nov  2  2021 /usr/bin/python -> /etc/alternatives/unversioned-python
    You can see that the link is not set on e1n2/e1n3, for example:
    [root@e1n1 ansible]# ssh e1n2 ls -lah /usr/bin/python
    ls: cannot access '/usr/bin/python': No such file or directory
  2. Remove invalid symbolic link on e1n1. Run:
    [root@e1n1 ansible]# rm /usr/bin/python
      rm: remove symbolic link '/usr/bin/python'? y
    
OCP node fbond advertising with wrong MAC address
This problem happens during the upgrade from 2.0.0 or 2.0.1 to 2.0.2.x. If you see an error similar to the following:
x509: certificate is valid for 9.0.32.46, not 9.0.59.79
This might mean that the MAC address that is being advertised for fbond on the node is the incorrect MAC address. In the error example, the correct IP is the 9.0.32.X address, but the 9.0.59.Y address is in the dynamic range for devices that are not registered in DHCP (that is 9.0.48.1-9.0.63.235), which might mean that the node has an improper MAC address and the DHCP server has no record of it. Set the MAC address on the node back to the MAC address listed in fab_1 of the SJC.
Workaround:
  1. Run:
    oc get nodes -o wide
    and find the node that has the IP address that matches the second IP from the error message that is 9.0.59.79 (eXnY) from the example.
  2. Find the correct MAC address for the node by looking in the SJC for the node, in the "fab_1" section the component/subcomponent entry for the given eXnY in step 1. For example:
       "fab_1": {
                            "mac_addr": "1c:34:da:48:c4:64",
                            "sw_alias": "",
                            "sw_port": ""
                        },
  3. Run the following commands:
    ssh -ql core@eXnY  (by using node name from step 1)
       MAC=MAC address from step 2
       sudo  nmcli con mod fbond 802-3-ethernet.mac-address $MAC
       sudo nmcli con mod fbond 802-3-ethernet.cloned-mac-address $MAC
       sudo nmcli con down fbond
       sudo nmcli con up fbond
OCS storage nodes reconfinguring is required when upgrading from 2.0.0 or 2.0.1.0 version to 2.0.2 or greater
If you are running Cloud Pak for Data System version 2.0.0 or 2.0.1 and want to upgrade to 2.0.2.0 or greater, you must reconfigure OCS storage nodes. The chosen layout of the Red Hat OpenShift Container Storage (OCS) 4.X cluster changed starting with Cloud Pak for Data System 2.0.1.1 onward. Before that, OCS was configured to source the NVMe drives from all worker nodes to the OCS cluster in order to provide maximum space utilization. However, sourcing drives to OCS costs vCPUs that reduce the amount of vCPUs available, which you might want to use for Cloud Pak for Data (CPD). Also, sourcing drives from more nodes increases the length of time that is required for Machineconfig updates, which are required by Red Hat OpenShift (OCP) for many operations. This is why in 2.0.1.1 onward, the cluster is kept to a minimal size by default to reserve the vCPU resources for the CPD applications themselves. There are scripts in place that allow you to scale up the OCS cluster if absolutely needed, but before you must consult with technical sales in accordance with SS4CP licensing terms.
Run the following steps after the platform upgrade, but before the OCP upgrade. They must be run at that time. They help to determine whether you are ready to scale OCS down as required, and then how to perform that scale-down operation.
  1. Determine whether you are using more than the amount of storage provided by a 3-node OCS OSD cluster.
    Note: The amount of available space on the 3-node OCS OSD cluster is 15.36 TB. However, OCS is set up with 3-way replication, so the raw space capacity is 46.2 TB, which is equal to 42 TiB. You must scale the cluster down to that amount if you are not using more than that. OCS reports space usage not in usable capacity, but raw capacity, and not in TB or GB, but TiB or GiB.
  2. Determine whether the OCS cluster is ready to be scaled down. ssh to e1n1 and run:
    oc -n openshift-storage rsh `oc get pods -n openshift-storage | grep ceph-tool | cut -d ' ' -f1` ceph status | grep usage
    for example :
    [root@e1n1 ~]# oc -n openshift-storage rsh `oc get pods -n openshift-storage | grep ceph-tool | cut -d ' ' -f1` ceph status | grep usage
        usage:   1003 GiB used, 84 TiB / 85 TiB avail
    if you are using more than 42 TiB, ignore the rest of these steps and proceed with the upgrade.
  3. Scale down the OCS cluster.
    Note: This step might take some time and varies by cluster size. To calculate a rough estimate assuming that nothing fails and stops the automated scale-down early, ssh to e1n1 and run:
    nodes=(`oc get nodes | grep fbond | awk '{print $1}'`) ; echo "$(($((${#nodes[@]} - 6)) * 10)) minutes"
    for example :
    [root@e1n1 ~]# nodes=(`oc get nodes | grep fbond | awk '{print $1}'`) ; echo "$(($((${#nodes[@]} - 6)) * 10)) minutes"
    20 minutes
    You can run the scale-down in a screen session to prevent closing the terminal window and prematurely stopping the process.
  4. ssh to e1n1 and run:
    nodes=(`oc get nodes | grep fbond | awk '{print $1}'`) ; for node in ${nodes[@]:6}; do /opt/ibm/appliance/platform/xcat/scripts/storage/ocs/remove_node.py --node $node; done
    This scale-down step might fail. If it fails, you get an error message on the command line where you ran the script. After the failure, rerun it and see whether it progresses past the exact point it failed before. Run it until progress between runs is unchanged on consecutive runs or until it runs without any error messages anymore.
  5. Tail the log for this operation in a separate terminal window for monitoring, future reference, or troubleshooting. ssh to e1n1 and run:
    tail -f /var/log/appliance/platform/xcat/remove_ocs_node.log.tracelog
    You can expect a similar output, for example:
    2022-10-19 10:05:55.221188 TRACE: 
    2022-10-19 10:05:56.221589 TRACE: Running command: oc rsh -n openshift-storage rook-ceph-tools-5d95856894-6mjlw ceph status
    2022-10-19 10:05:56.965374 TRACE: Gathered ceph status: HEALTH_OK
    2022-10-19 10:05:56.965871 TRACE: Running command: oc rsh -n openshift-storage rook-ceph-tools-5d95856894-6mjlw ceph health detail
    2022-10-19 10:05:57.680450 TRACE: Running command: oc label node e2n4.fbond cluster.ocs.openshift.io/openshift-storage-
    2022-10-19 10:05:57.850821 TRACE: Running command: oc label node e2n4.fbond topology.rook.io/rack-
    2022-10-19 10:05:58.026213 TRACE: Running command: oc get nodes --show-labels
    2022-10-19 10:05:58.188044 TRACE: Running command: oc get -n openshift-storage pods -l app=rook-ceph-operator
    2022-10-19 10:05:58.356336 TRACE: Running command: oc get -n openshift-storage pods -l app=rook-ceph-operator
    2022-10-19 10:05:58.518562 TRACE: Running command: oc rsh -n openshift-storage rook-ceph-tools-5d95856894-6mjlw ceph osd df
    2022-10-19 10:06:04.229206 TRACE: Running command: oc get pdb -n openshift-storage
    2022-10-19 10:06:04.378265 TRACE: ********** Node e2n4.fbond removed from OCS! **********
Automatic pod failover might not work in case of a node failure and pods might get stuck in Terminating state
The automatic pod failover might not work in the case of an abrupt node failure. OpenShift notices that the node is not ready, and starts evicting the pods. But they get stuck in Terminating state.
[root@gt14-node1 ~]# oc get pod -n nps-1
NAME                       READY   STATUS        RESTARTS   AGE
bnr-san-storage-server-0   1/1     Terminating   0          3h12m
ipshost-0                  1/1     Terminating   0          13d
nps-1-init-host-7jfx4      1/1     Running       0          13d
nps-1-init-host-c4gt4      1/1     Running       0          13d

Workaround:

Run the oc delete command:
oc delete pod/<pod-name> --force --grace-period=0 -n <your-ns>

OR

Do a manual etcd database update by running:
$ oc project openshift-etcd
$ oc rsh -c etcdctl <etcd-pod-name>
sh-4.4# etcdctl del /kubernetes.io/pods/<namespace>/<pod-name>
The 2.0.2.1 upgrade fails during HpiPostinstaller.operatorpostinstallAbstractUpgrader phase due to unstable pods
The 2.0.2.1 upgrade fails during HpiPostinstaller.operatorpostinstallAbstractUpgrader phase due to unstable pods with the following error:
1. HpiPostinstaller.operatorpostinstall
        Upgrade Detail: Component post install for hpi
        Caller Info:The call was made from 'HpiPostinstaller.do_operator_postinstall' on line 341 with file located at '/localrepo/2.0.2.1_release/EXTRACT/platform/upgrade/bundle_upgraders/../h
        Message: AbstractUpgrader.postinstaller:Encountered error in postinstall of HPI operator
{}
Workaround:
  1. Run the following command:
    oc get pods -n ap-hpid -o wide | grep ap-hpid
  2. Monitor the pods by running the oc command in step 1 until they are healthy.
    Example of unhealthy pods:
    ap-hpid-daemonset-49ct4   0/1     Terminating         0          2d19h   9.0.32.21   e2n2.fbond   <none>           <none>
                               ap-hpid-daemonset-9lv5z   0/1     Terminating         0          2d19h   9.0.32.20   e2n1.fbond   <none>           <none>
                               ap-hpid-daemonset-bbvl5   0/1     ContainerCreating   0          7s      9.0.32.23   e2n4.fbond   <none>           <none>
                               ap-hpid-daemonset-bp9lv   0/1     Pending             0          7s      <none>      <none>       e2n1.fbond       <none>
                               ap-hpid-daemonset-pkqcn   0/1     Pending             0          7s      <none>      <none>       e2n2.fbond       <none>
                               ap-hpid-daemonset-rhsvw   0/1     Pending             0          7s      <none>      <none>       e1n4.fbond       <none>
                               ap-hpid-daemonset-xhgv6   0/1     ContainerCreating   0          7s      9.0.32.22   e2n3.fbond   <none>           <none>
                               ap-hpid-daemonset-z28lx   1/1     Terminating         0          2d19h   9.0.32.19   e1n4.fbond   <none>           <none>
  3. Restart the upgrade if every pod is operational and healthy.
    Example of healthy pods:
    ap-hpid-daemonset-5fbpf   1/1     Running   1          38h   9.0.32.22   e2n3.fbond   <none>           <none>
    ap-hpid-daemonset-8c2gw   1/1     Running   1          38h   9.0.32.23   e2n4.fbond   <none>           <none>
    ap-hpid-daemonset-db6h4   1/1     Running   0          38h   9.0.32.20   e2n1.fbond   <none>           <none>
    ap-hpid-daemonset-h7qmc   1/1     Running   1          38h   9.0.32.21   e2n2.fbond   <none>           <none>
    ap-hpid-daemonset-l4qlg   1/1     Running   2          38h   9.0.32.37   e5n1.fbond   <none>           <none>
    ap-hpid-daemonset-px7fq   1/1     Running   1          38h   9.0.32.28   e4n1.fbond   <none>           <none>
    ap-hpid-daemonset-qmmpc   1/1     Running   1          38h   9.0.32.19   e1n4.fbond   <none>           <none>