2.0.2.1 Interim Fix 1 release notes
2.0.2.1 Interim Fix 1 comes with connector node firmware upgrade and a new Cyclops package to support OCP/OCS 4.12 upgrade.
- What's new
-
- RHEL is updated to version 8.8.
- Cyclops package to support OCP/OCS 4.12. For more information, see Upgrading Cloud Pak for Data System.
- With Cloud Pak for Data System 2.0.2.1 IF1, Lenovo connector nodes (SR650) boot mode is changed from Legacy to UEFI.
- Connector nodes have new firmware upgrade:
- XCC
- UEFI
- Fixed issues
-
- Fixed fabric switch memory leaks that caused the switches to run out of resources.
- Known issues
-
- Installing OCP fails during 2.0.2.1 IF1 upgrade due to MCP issue.
- This problem occurs when there is a timeout waiting for all the nodes to be in
Ready
state. It can potentially occur during any MCP update. You can see aFATAL_ERROR
inapupgrade.log
due to an MCP timeout. You can also see that the nodes are inNotReady,SchedulingDisabled
state, after running oc get nodes, for example:
Also,[root@gt01-node1 ~]# oc get nodes NAME STATUS ROLES AGE VERSION e1n1-master.fbond Ready master 38h v1.19.0+d670f74 e1n2-master.fbond Ready master 38h v1.19.0+d670f74 e1n3-master.fbond Ready master 38h v1.19.0+d670f74 e1n4.fbond Ready worker 37h v1.19.0+d670f74 e2n1.fbond NotReady,SchedulingDisabled worker 37h v1.19.0+d670f74 e2n2.fbond Ready worker 37h v1.19.0+d670f74 e2n3.fbond Ready worker 37h v1.19.0+d670f74 e2n4.fbond Ready worker 37h v1.19.0+d670f74
crio.service
is inactive, for example:
At this point, the node is in[root@gt01-node1 ~]# ssh core@e2n1 [core@e2n1 ~]$ sudo su [root@e2n1 core]# systemctl status crio â crio.service - Open Container Initiative Daemon Loaded: loaded (/usr/lib/systemd/system/crio.service; disabled; vendor preset: disabled) Drop-In: /etc/systemd/system/crio.service.d ââ10-mco-default-env.conf, 20-nodenet.conf Active: inactive (dead) Docs: https://github.com/cri-o/cri-o [root@e2n1 core]#
NotReady
state with ping to the node working.- Workaround
-
- Run the following commands for the node in the
NotReady
state:ssh core@<node>.fbond sudo systemctl stop crio-wipe sudo systemctl stop crio sudo systemctl stop kubelet sudo chattr -iR /var/lib/containers/* sudo rm -rf /var/lib/containers/* sudo systemctl start crio-wipe sudo systemctl start crio sudo systemctl start kubelet exit
- Restart the upgrade.
- Run the following commands for the node in the
- The "Unhealthy component detected" warning after upgrading to 2.0.2.1 IF1 from version 2.0.2.1
- After upgrading to 2.0.2.1 IF1 from Cloud Pak for Data System
2.0.2.1, the ap issue command might return the following warning:
[root@csr-node1 ~]# ap issues Open alerts (issues) and unacknowledged events +------+---------------------+--------------------+-------------------------------------------+-------------------------------+----------+--------------+ | ID | Date (EDT) | Type | Reason Code and Title | Target | Severity | Acknowledged | +------+---------------------+--------------------+-------------------------------------------+-------------------------------+----------+--------------+ | 1297 | 2023-10-20 15:17:39 | HW_NEEDS_ATTENTION | 201: Unhealthy component detected | hw://enclosure7.node1.bmc1 | WARNING | N/A | | 1298 | 2023-10-20 15:17:39 | HW_NEEDS_ATTENTION | 201: Unhealthy component detected | hw://enclosure8.node1.bmc1 | WARNING | N/A | | 1299 | 2023-10-20 15:19:36 | HW_NEEDS_ATTENTION | 201: Unhealthy component detected | hw://enclosure2.node3.bmc1 | WARNING | N/A | | 1300 | 2023-10-20 15:19:36 | HW_NEEDS_ATTENTION | 201: Unhealthy component detected | hw://enclosure6.node2.bmc1 | WARNING | N/A | | 1301 | 2023-10-20 15:19:50 | HW_NEEDS_ATTENTION | 201: Unhealthy component detected | hw://enclosure5.node3.bmc1 | WARNING | N/A | | 1302 | 2023-10-20 15:19:50 | HW_NEEDS_ATTENTION | 201: Unhealthy component detected | hw://enclosure6.node4.bmc1 | WARNING | N/A | | 1303 | 2023-10-20 15:19:50 | HW_NEEDS_ATTENTION | 201: Unhealthy component detected | hw://enclosure1.node4.bmc1 | WARNING | N/A | | 1304 | 2023-10-20 15:20:05 | HW_NEEDS_ATTENTION | 201: Unhealthy component detected | hw://enclosure5.node2.bmc1 | WARNING | N/A | | 1305 | 2023-10-20 15:20:05 | HW_NEEDS_ATTENTION | 201: Unhealthy component detected | hw://enclosure6.node3.bmc1 | WARNING | N/A | | 1306 | 2023-10-20 15:20:05 | HW_NEEDS_ATTENTION | 201: Unhealthy component detected | hw://enclosure6.node1.bmc1 | WARNING | N/A | | 1307 | 2023-10-20 15:20:20 | HW_NEEDS_ATTENTION | 201: Unhealthy component detected | hw://enclosure5.node4.bmc1 | WARNING | N/A | | 1308 | 2023-10-20 15:20:20 | HW_NEEDS_ATTENTION | 201: Unhealthy component detected | hw://enclosure2.node4.bmc1 | WARNING | N/A | | 1309 | 2023-10-20 15:20:20 | HW_NEEDS_ATTENTION | 201: Unhealthy component detected | hw://enclosure2.node2.bmc1 | WARNING | N/A | | 1310 | 2023-10-20 15:20:34 | HW_NEEDS_ATTENTION | 201: Unhealthy component detected | hw://enclosure2.node1.bmc1 | WARNING | N/A | | 1311 | 2023-10-20 15:21:11 | HW_NEEDS_ATTENTION | 201: Unhealthy component detected | hw://enclosure1.node1.bmc1 | WARNING | N/A | | 1312 | 2023-10-20 15:21:25 | HW_NEEDS_ATTENTION | 201: Unhealthy component detected | hw://enclosure5.node1.bmc1 | WARNING | N/A | | 1313 | 2023-10-20 15:32:22 | SW_NEEDS_ATTENTION | 495: Shared NPS config pool is not paused | sw://openshift.mcp/nps-shared | MINOR | N/A | | 1318 | 2023-10-21 15:03:36 | HW_NEEDS_ATTENTION | 201: Unhealthy component detected | hw://enclosure1.node3.bmc1 | WARNING | N/A | | 1319 | 2023-10-22 19:31:44 | HW_NEEDS_ATTENTION | 201: Unhealthy component detected | hw://enclosure1.node2.bmc1 | WARNING | N/A | +------+---------------------+--------------------+-------------------------------------------+-------------------------------+----------+--------------+
- Workaround
- Run the following commands after the upgrade:
-
oc get pods -n ap-magneto -o custom-columns=POD:.metadata.name,POD:.metadata.namespace --no-headers | awk '{print $1 " --namespace=" $2}'| xargs oc delete pod
-
oc get pods -n ap-hardware-monitoring -o custom-columns=POD:.metadata.name,POD:.metadata.namespace --no-headers | awk '{print $1 " --namespace=" $2}'| xargs oc delete pod
-
- Mellanox as 5.8 in
<major>.<minor>
version format - ap version -s and ap version -d display the version of
Mellanox as 5.8 in
<major>.<minor>
version format. - Upgrading to 2.0.2.1 IF1 might fail due to
e2n2
not ready - The upgrade to 2.0.2.1 IF1 might fail with an error that states e2n2.fbond openshift
node(s) found not Ready.
Performing operator upgrade pre-checks for component magneto 2023-11-02 10:59:08 INFO: magneto:Starting Magneto operator pre-install checks... 2023-11-02 10:59:08 INFO: Checking if pods are healthy in ap-magneto namespace 2023-11-02 10:59:08 INFO: All ap-magneto pods healthy 2023-11-02 10:59:08 INFO: magneto:Done executing Magneto operator pre-install checks. 2023-11-02 10:59:08 ATTENTION: Attempting self-healing fix for operation "check_oc_nodes_active" 2023-11-02 10:59:08 ERROR: Error running command [sudo rm -rf /var/lib/containers/*] on node core@['e2n2.fbond']. 2023-11-02 10:59:08 ERROR: sss_ssh_knownhostsproxy: Could not resolve hostname [e2n2.fbond] 2023-11-02 10:59:08 ERROR: kex_exchange_identification: Connection closed by remote host 2023-11-02 10:59:08 ERROR: ERROR: Error running command [sudo rm -rf /var/lib/containers/*] on node core@['e2n2.fbond']. 2023-11-02 10:59:08 ERROR: sss_ssh_knownhostsproxy: Could not resolve hostname [e2n2.fbond] 2023-11-02 10:59:08 ERROR: kex_exchange_identification: Connection closed by remote host 2023-11-02 10:59:08 ERROR: 2023-11-02 10:59:08 FATAL ERROR: Upgrade prerequisites not met. The system is not ready to attempt an upgrade. 2023-11-02 10:59:08 FATAL ERROR: check_oc_nodes_active : e2n2.fbond openshift node(s) found not Ready 2023-11-02 10:59:08 FATAL ERROR: 2023-11-02 10:59:08 FATAL ERROR: More Info: See trace messages at /var/log/appliance/apupgrade/20231102/apupgrade20231102081610.log.tracelog for additional troubleshooting information. 2023-11-02 10:59:08 logfile is /var/log/appliance/apupgrade/20231102/apupgrade20231102081610.log 2023-11-02 10:59:08 ======================================================== ATTENTION : Errors encountered during upgrade operation ======================================================== Logfile: /var/log/appliance/apupgrade/20231102/apupgrade20231102081610.log Upgrade Command: apupgrade --upgrade --use-version 2.0.2.1.IF1_release --upgrade-directory /localrepo --phase platform 1. check_oc_nodes_active Upgrade Detail: Pre-install Checks Caller Info:The call was made from 'YosemiteBundleUpgradeChecker.check_oc_nodes_active' on line 372 with file located at '/localrepo/2.0.2.1.IF1_release/EXTRACT/platform/upgrade/modules/ibm/ca/checker/yosemite_bundleupgradechecker.py' Message: e2n2.fbond openshift node(s) found not Ready ========================================================
- Workaround
-
- Identify one or more nodes that are disabled and caused the failure.
- Enable those nodes.
- Confirm that the nodes are online and restart upgrade.
- The ap version -s might not display the upgrade information for CPDS version
2.0.2.1 IF1, and the upgrade fails on the component
component_version_cronjob.py
with the following error in the tracelog: -
[root@e1n1-QVR component_version_cronjob]# /usr/bin/python3 /opt/ibm/appliance/apupgrade/modules/ibm/ca/component_version_cronjob/component_version_cronjob.py Traceback (most recent call last): File "/opt/ibm/appliance/apupgrade/modules/ibm/ca/component_version_cronjob/component_version_cronjob.py", line 277, in <module> software_version = versions["appliancesoftwareversion"][first_node].split("-", 1)[0] KeyError: 'e10n1'
- Workaround
-
- Run the following command to replace the node
e1n1
:sed -i -e 's|first_node = get_nodes()\[0\]|first_node = "e1n1"|g' /opt/ibm/appliance/apupgrade/modules/ibm/ca/component_version_cronjob/component_version_cronjob.py
- Rerun the script
component_version_cronjob.py
with the following command:python3 /opt/ibm/appliance/apupgrade/modules/ibm/ca/component_version_cronjob/component_version_cronjob.py
- Run the following command to replace the node
Before you begin
- Download the following package 2.0.2.1.IF1-WS-ICPDS-fpXXX, where XXX stands for the latest package number, from Fix Central.
- The system needs to be on version 2.0.2.1 before applying the interim fix.
- If NPS® has non-default admin account credentials, the following actions must be completed before you can upgrade:
- Ensure that you have the NPS database
admin
user password. - In
/home/nz/.bashrc
file inside the container, setNZ_USER=admin
andNZ_PASSWORD=<customer_password>
.
- Ensure that you have the NPS database
Dell/Lenovo | Upgrade time |
---|---|
Lenovo Base + 8 | 7 hours (NPS, firmware downgrade) |
Lenovo Base | 5 hours (firmware downgrade) |