Version 2.0.2.1 release notes
Cloud Pak for Data System version 2.0.2.1 improves the upgrade process.
Upgrading
zen
or ap-console
), those tenants are upgraded during the service bundle upgrade
automatically, unless you pin the installation to a specific version in the
ZenService
custom resource. For more information, see the Manual upgrade
section in Choosing an upgrade plan for the Cloud Pak for Data control
plane. The end to end upgrade time for 2.0.1.1 to 2.0.2.1 ranges from 24 to 30 hours. This includes system, firmware, and OCP/OCS component upgrades.
Software components
-
Red Hat OpenShift Container Platform 4.8.37
-
Red Hat OpenShift Data Foundation 4.8.7
-
Cloud Pak for Data 4.0.2
For information about new features, see What's new in IBM Cloud Pak for Data.
-
Netezza Performance Server 11.2.2.3
For information about new features, see Netezza Performance Server 11.2.2.3 release notes.
Fixed issues
- Fixed the issue of OCS reporting unhealthy due to
mon
low on available space during the upgrade from 2.0.1.0 to 2.0.2.0. - Fixed the issue of OCS reporting unhealthy due to
mon
withslow ops
during the upgrade from 2.0.1.0 to 2.0.2.0.
Known issues
- Nodes reboot at the preliminary-checks step during the 2.0.2.1 upgrade
check_ipmitool_lan
is one of the preliminary-checks during 2.0.2.1 upgrade and when this check fails, you can see the following messages in the upgrade log:
Workaround:LOGGING FROM: node_os_prechecker_fixer.py:fix_check_ipmitool_lan:81 2022-02-28 14:21:48 ERROR: Error running command [ifup fbond] on [u'e1n1'] LOGGING FROM: node_os_prechecker_fixer.py:fix_check_ipmitool_lan:81 2022-02-28 14:21:48 ERROR: nodeos:NodeosUpgrader.prechecker.fixer:Unable to fix problem with ipmitool lan. Fatal Problem: Failed to start GPFS. LOGGING FROM: node_os_prechecker_fixer.py:fix_check_ipmitool_lan:85 2022-02-28 14:21:48 TRACE: In method logger.py:log_error:142 from parent method node_os_prechecker_fixer.py:fix_check_ipmitool_lan:85 with args msg = nodeos:NodeosUpgrader.prechecker.fixer:Unable to fix problem with ipmitool lan. Fatal Problem: Failed to start GPFS. LOGGING FROM: node_os_prechecker_fixer.py:fix_check_ipmitool_lan:85 2022-02-28 14:21:48 Upgrade prerequisites not met. The system is not ready to attempt an upgrade. check_ipmitool_lan : nodeos:NodeosUpgrader.prechecker:Problem encountered during check for problem with lan interface. A reboot of the following nodes is suggested [u'e1n1', u'e2n1', u'e3n1'] LOGGING FROM: bundle_upgrade.py:report_any_failed_operation_results:1124
- Check the uptime of all the nodes requiring reboot. If uptime is smaller than five minutes, it means that the reboot was done automatically.
- If uptime is greater than five minutes, reboot the nodes manually.
- After the required nodes reboot, restart upgrade. For
example:
apupgrade --preliminary-check-with-fixes --use-version 2.0.2.1 --upgrade-directory /localrepo --phase platform
- MCP update fails during 2.0.2.1 upgrade
- During
--phase platform
upgrade, MCP update fails with the following message:
It can also fail afterINFO: Function: check_update_completion failed after 90 tries : Nodes in worker pool are updating INFO: AbstractUpgrader.postinstaller:MCP update failed, attempting self-recovery ERROR: AbstractUpgrader.postinstaller:Unable to perform self-recovery, MCP update failed: AbstractUpgrader.postinstaller:Unable to perform self-recovery, the cause of error is unknown INFO: mcp:Done executing MCP post-install steps. FATAL ERROR: Errors encountered FATAL ERROR: FATAL ERROR: McpUpgrader.postinstall : ERROR: AbstractUpgrader.postinstaller:Unable to perform self-recovery, MCP update failed: AbstractUpgrader.postinstaller:Unable to perform self-recovery, the cause of error is unknown
--phase platform
completes, with the following message:
The problem is with the"'oc describe node | grep state' would mention unable to drain the pod zen-metastoredb-xx"
zen-metastoredb
pod eviction.Workaround:- Run:
oc delete pdb zen-metastoredb-budget -n zen
- Run:
for the node to drain the pod and continue with the MCP update.oc delete pdb zen-metastoredb-budget -n ap-console
- Run:
to determine whether upgrade is still running or not after bothps fx | grep apupgrade
oc delete pdb
commands. If it is running, the upgrade continues . - If upgrade stopped running, restart it.
- Run:
- 2.0.2.1 platform upgrade fails after GPFS reboot because a node is unreachable
- When you run into this issue, you get the output on stopping GPFS and checking whether the nodes
are available. After the check, you can see one of the nodes is not reachable and the following
error:
Then, the upgrade stops.INFO: Some nodes were not available. ['eXnY] ERROR: Unable to powercycle nodes via ipmitool. ERROR: 'bmc_addr' ERROR: The following nodes are still unavailable after a reboot attempt: ['eXnY'] FATAL ERROR: Problem rebooting nodes
Workaround:- Reboot the unreachable node.
- Mount the
filesystems
. Run:mmmount all -a
- Run:
for example :mmlsmount all -L
[root@gt03-node1 ~]# mmlsmount all -L File system ips is mounted on 3 nodes: 9.0.32.16 e1n1 9.0.32.17 e1n2 9.0.32.18 e1n3 File system platform is mounted on 3 nodes: 9.0.32.18 e1n3 9.0.32.16 e1n1 9.0.32.17 e1n2
- Restart the upgrade.
- 2.0.2.1 upgrade fails during platform phase due to unstable MCP state
- This failure happens during
--phase platform
due to the following error:
You can find more information in the corresponding log, for example: /var/log/appliance/apupgrade/20220630/apupgrade20220630082336.logFATAL ERROR: MCP is not in a stable state. Please resolve this issue before attempting an upgrade.
Workaround:- Check the nodes and look for those that are
NotReady
by running:
for example :oc get nodes
[root@gt01-node1 ~]# oc get nodes NAME STATUS ROLES AGE VERSION e1n1-master.fbond Ready master 38h v1.19.0+d670f74 e1n2-master.fbond Ready master 38h v1.19.0+d670f74 e1n3-master.fbond NotReady master 38h v1.19.0+d670f74 e1n4.fbond Ready worker 38h v1.19.0+d670f74 e2n1.fbond Ready worker 38h v1.19.0+d670f74 e2n2.fbond Ready worker 38h v1.19.0+d670f74 e2n3.fbond NotReady worker 38h v1.19.0+d670f74 e2n4.fbond Ready worker 38h v1.19.0+d670f74
- Run:
to check the MCP status. For example:oc get mcp
[root@gt01-node1 ~]# oc get mcp NAME CONFIG UPDATED UPDATING DEGRADED MACHINECOUNT READYMACHINECOUNT UPDATEDMACHINECOUNT DEGRADEDMACHINECOUNT AGE master rendered-master-f17afa51f05e6aed60f2eb1931db6c63 False True False 3 2 3 0 38h unset rendered-unset-3e6f7f0f84f4a0796786b102c7679233 False True False 2 1 2 0 38h worker rendered-worker-3e6f7f0f84f4a0796786b102c7679233 True False False 3 3 3 0 38h
- Verify the CSR certificates for the nodes that show
True
in theDEGRADED
orUPDATING
columns. - Run:
to check whether there are any pending certificates.oc get csr
- Run:
to approve all certificates.oc get csr -o name | xargs oc adm certificate approve
- Run:
oc get mcp
to confirm that all nodes areoc get nodes
Ready
and all MCPs areUPDATED
. For example:[root@gt01-node1 ~]# oc get nodes NAME STATUS ROLES AGE VERSION e1n1-master.fbond Ready master 38h v1.19.0+d670f74 e1n2-master.fbond Ready master 38h v1.19.0+d670f74 e1n3-master.fbond Ready master 38h v1.19.0+d670f74 e1n4.fbond Ready worker 38h v1.19.0+d670f74 e2n1.fbond Ready worker 38h v1.19.0+d670f74 e2n2.fbond Ready worker 38h v1.19.0+d670f74 e2n3.fbond Ready worker 38h v1.19.0+d670f74 e2n4.fbond Ready worker 38h v1.19.0+d670f74
[root@gt01-node1 ~]# oc get mcp NAME CONFIG UPDATED UPDATING DEGRADED MACHINECOUNT READYMACHINECOUNT UPDATEDMACHINECOUNT DEGRADEDMACHINECOUNT AGE master rendered-master-f17afa51f05e6aed60f2eb1931db6c63 True False False 3 3 3 0 38h unset rendered-unset-3e6f7f0f84f4a0796786b102c7679233 True False False 3 3 3 0 38h worker rendered-worker-3e6f7f0f84f4a0796786b102c7679233 True False False 3 3 3 0 38h
- Restart the upgrade.
- Check the nodes and look for those that are
- 2.0.2.1 upgrade fails during the OpenShift phase
- This failure happens after
apupgrade
starts the system, brings it toReady
state and then runs theoc login
command. Whenapupgrade
terminates with a timeout,oc login
seems to be working manually and after you restart theapupgrade
, the sameoc login
command works. If you ran into this issue, you can see the following errors in theapupgrade.log
:-
ERROR: STDERR: [error: Missing or incomplete configuration info. Please point to an existing, complete config file:
-
FATAL ERROR: McpUpgrader.preinstall : Failed to login to openshift server : error: The server uses a certificate signed by unknown authority. You may need to use the --certificate-authority flag to provide the path to a certificate file for the certificate authority, or --insecure-skip-tls-verify to bypass the certificate check and use insecure connections.
-
FATAL ERROR: McpUpgrader.preinstall : Openshift login timed out after waiting for 30 minutes. Please run following command manually and if successful resume upgrade, or contact IBM Support for help
Workaround:- Wait for apupgrade to time out.
- Run oc login command, which is written in the log. For
example:
oc login --token=$(cat /root/.sa/token) https://api.localcluster.fbond:6443
- Restart the upgrade.
-
- Installing OCP fails during 2.0.2.1 upgrade due to MCP issue
- This problem occurs when there is a timeout waiting for all the nodes to be in
Ready
state. It can potentially occur during any MCP update. You can see aFATAL_ERROR
inapupgrade.log
due to an MCP timeout. You can also see that the nodes are inNotReady,SchedulingDisabled
state, after running oc get nodes, for example:
Additionally,[root@gt01-node1 ~]# oc get nodes NAME STATUS ROLES AGE VERSION e1n1-master.fbond Ready master 38h v1.19.0+d670f74 e1n2-master.fbond Ready master 38h v1.19.0+d670f74 e1n3-master.fbond Ready master 38h v1.19.0+d670f74 e1n4.fbond Ready worker 37h v1.19.0+d670f74 e2n1.fbond NotReady,SchedulingDisabled worker 37h v1.19.0+d670f74 e2n2.fbond Ready worker 37h v1.19.0+d670f74 e2n3.fbond Ready worker 37h v1.19.0+d670f74 e2n4.fbond Ready worker 37h v1.19.0+d670f74
crio.service
is inactive, for example:
At this point, the node is in[root@gt01-node1 ~]# ssh core@e2n1 [core@e2n1 ~]$ sudo su [root@e2n1 core]# systemctl status crio â crio.service - Open Container Initiative Daemon Loaded: loaded (/usr/lib/systemd/system/crio.service; disabled; vendor preset: disabled) Drop-In: /etc/systemd/system/crio.service.d ââ10-mco-default-env.conf, 20-nodenet.conf Active: inactive (dead) Docs: https://github.com/cri-o/cri-o [root@e2n1 core]#
NotReady
state with ping to the node working.Workaround:- Run the following commands for the node in the
NotReady
state:ssh core@<node>.fbond sudo systemctl stop crio-wipe sudo systemctl stop crio sudo systemctl stop kubelet sudo chattr -iR /var/lib/containers/* sudo rm -rf /var/lib/containers/* sudo systemctl start crio-wipe sudo systemctl start crio sudo systemctl start kubelet exit
- Restart the upgrade.
- Run the following commands for the node in the
- 2.0.2.1 upgrade fails due to OCP upgrade failure
- This problem happens when the OCP component upgrade fails due to
image-registry
being degraded. You can also see the following output in the tracelog:LOGGING FROM: ocp_installer.py:run_ocp_upgrade_script:77 2022-07-12 11:10:32 TRACE: RC: 1. STDOUT: [2022-07-12 11:07:31.326473 INFO: Targeted OCP version is 4.7.51. 2022-07-12 11:07:31.326647 INFO: Starting OCP upgrade ... 2022-07-12 11:07:31.326702 INFO: Validating program parameters... 2022-07-12 11:07:31.326921 INFO: Validating all the nodes are in Ready state ... 2022-07-12 11:07:31.490570 INFO: Validating the system state... 2022-07-12 11:10:32.119651 ERROR: Failed to validate the cluster operators. 2022-07-12 11:10:32.119708 ERROR: Failed to verify the system state. 2022-07-12 11:10:32.119764 INFO: OCP upgrade to version 4.7.51 failed!!. ]
To determine whether you ran into the issue, follow these steps:- Run:
to confirm theoc get co | grep image-registry
image-registry
shows the following string:image-registry 4.6.32 True False True 42h
- Run:
to confirm that the output shows the following message:oc describe co image-registry
Message: ImagePrunerDegraded: Job has reached the specified backoff limit Reason: ImagePrunerJobFailed Status: True Type: Degraded
Workaround:- Check the
managementState
by running:oc get config cluster -o yaml
- If the
managementState
isRemoved
change it toManaged
and save it. Run:oc edit config cluster managementState : Managed
- If the issue persists and step 2 did not solve it,
run:
oc patch imagepruner.imageregistry/cluster --patch '{"spec":{"suspend":true}}' --type=merge oc -n openshift-image-registry delete jobs --all
- Run:
- Ceph is in
HEALTH_WARN
state due to1 daemons have recently crashed
during 2.0.2.1 upgrade - You might run into a warning in Red Hat OpenShift Container Storage (OCS) in the underlying Ceph
cluster due to
daemons recently crashed
. After the 2.0.2.1 upgrade, there are checks for the health of the Ceph cluster built-in and they will flag this warning and stop the operation.Symptoms:
You can see the following error in the log:To confirm you hit the issue, run:2022-08-02 10:20:13 INFO: OcsUpgrader.postinstall: Finished post upgrade checks for OCS upgrade 2022-08-02 10:20:13 INFO: ocs:Done executing OCS post-install checks. 2022-08-02 10:20:13 FATAL ERROR: Errors encountered 2022-08-02 10:20:13 FATAL ERROR: 2022-08-02 10:20:13 FATAL ERROR: check_ocs_state : OcsUpgrader.postinstall:Error encountered during OCS cluster health check! 2022-08-02 10:20:13 FATAL ERROR: This error requires manual intervention to resolve. Please contact IBM Support.
and see whether the latter command returned anything on[root@e1n1 ~]# TOOLS_POD=$(oc get pods -n openshift-storage -l app=rook-ceph-tools -o name) [root@e1n1 ~]# oc rsh -n openshift-storage $TOOLS_POD ceph status | grep daemon | grep crashed
stdout
. If not, then you have ran into some other issue and you must contact IBM Support.Workaround:- Run:
[root@e1n1 ~]# NAMESPACE=openshift-storage [root@e1n1 ~]# ROOK_OPERATOR_POD=$(oc -n ${NAMESPACE} get pod -l app=rook-ceph-operator -o jsonpath='{.items[0].metadata.name}') [root@e1n1 ~]# oc exec -it ${ROOK_OPERATOR_POD} -n ${NAMESPACE} -- ceph crash archive-all --cluster=${NAMESPACE} --conf=/var/lib/rook/${NAMESPACE}/${NAMESPACE}.config --keyring=/var/lib/rook/${NAMESPACE}/client.admin.keyring
- Run:
to ensure that there is no output on[root@e1n1 ~]# oc rsh -n openshift-storage $TOOLS_POD ceph status | grep daemon | grep crashed
stdout
now.Note: If you still see output onstdout
, wait for another 30 seconds and try again. If thestdout
output persists, contact IBM Support.
- Run:
aposDnsCheck
fails if there is no wildcard, which breaks 2.0.2.1 upgradeapupgrade
callsDnsCheck
with cluster down. You can see that theoc get route
command does not work. For example:random_host = ''.join(random.choice(string.digits + string.ascii_letters) for i in range(32)) if query_dns(random_host + "." + house_yml['all']['vars']['app_fqdn']) != 0: [rc, stdout, stderr] = cmd_runner.run_shell_cmd("oc get routes --all-namespaces | egrep -v localcluster") if rc == 0: logger.log_info("[RUNNING] - The oc command listed customer defined service specific entries.") else: logger.log_print("[FAIL] - Could not run oc get route command. Please make sure you have permissions and access to the oc command and try again.") issue_found = 1
DnsCheck
passes only if you have wildcard. If you do not have a wildcard, this check will fail and break the 2.0.2.1 upgrade.To determine whether you ran into the issue, follow these steps:- Run:
and check whether it fails.oc get route
- Run:
and check whether base FQDN is resolvable, but wildcards are not, for example:/opt/ibm/appliance/platform/apos-comms/tools/aposDnsCheck.py
Validating /opt/ibm/appliance/platform/apos-comms/customer_network_config/ansible/Customer.yml Checking hostname [RUNNING] - Trying to query base FQDN [PASS] - The base FQDN was resolvable [RUNNING] - Trying to query a wildcard entry [ERROR] DNS server <x.y.z.a> appears reachable, but is missing DNS record <xyz>. If you need a different suffix, add a DNS search string or specific explicitly [WARN] If you do not specify a wildcard entry, you must specify a DNS entry for each service that is deployed [RUNNING] - Trying to query customer defined service specific entries [ERROR] Could not run oc get route command. Please make sure you have permissions and access to the oc command and try again. [ERROR] Issue(s) issues above must be corrected on upstream DNS
- Apply the following workaround:
- Manually edit the
/localrepo/2.0.1.1/EXTRACT/platform/upgrade/aposcomms/aposcomms_prechecker.py file by
commenting out line 233:
Results.append(self.run_dns_check_util())
- Edit
/opt/ibm/appliance/apupgrade/bin/apupgrade
file by commenting out line 837:self.verify_bundle(self.working_dir)
- Manually edit the
/localrepo/2.0.1.1/EXTRACT/platform/upgrade/aposcomms/aposcomms_prechecker.py file by
commenting out line 233:
- If you run step 2., but get the following output:
Then you do not have proper[RUNNING] - Trying to query base FQDN [ERROR] Error trying to query base FQDN <customer-fqdn>, please make sure the base FQDN indicated in the house yaml can be resolved
DNS base FQDN
forwarding records. Stop the upgrade and fix your base FQDN records.
- Run:
- Storage overfilled warning is seen during the 2.0.2.1 upgrade
- During 2.0.2.1 upgrade /localrepo/2.0.2.1_release/ becomes overfilled and
you can see the following error:
Workaround:[root@e1n1 ~]# ap issues Open alerts (issues) and unacknowledged events +------+---------------------+---------------------+------------------------------------------+-------------------------------+----------+--------------+ | ID | Date (UTC) | Type | Reason Code and Title | Target | Severity | Acknowledged | +------+---------------------+---------------------+------------------------------------------+-------------------------------+----------+--------------+ | 1024 | 2022-09-15 22:41:07 | STORAGE_UTILIZATION | 901: Storage utilization above threshold | sw://fs.sda2/enclosure1.node1 | WARNING | N/A | +------+---------------------+---------------------+------------------------------------------+-------------------------------+----------+--------------+
- Run:
to remove the release directory you created at the start of the upgrade.rm -rf /localrepo/2.0.2.1_release/
- Run:
- Unexpected upgrade might be triggered, if your
zen operand
version is not pinned down to Cloud Pak for Data 4.2.0 (4.0.2 level) in theap-console
namespace - To prevent that, existing 2.x Cloud Pak for Data System customers
must pin down the
zen operand
version to 4.2.0 (Cloud Pak for Data 4.0.2 level) in theap-console
namespace. Cloud Pak for Data upgrade to a higher version should be done only on thezen
namespace.Workaround:- Run:
to determine the current version.oc get zenservice lite-cr -n ap-console -o json | jq .spec.version
- Depending on the
oc get
version results, perform the following actions:- If you get a version above 4.2.0, contact IBM Cloud Pak for Data Support.
- If you get 4.2.0 as the current version, no action is required.
- If you get a version below 4.2.0,
run:
oc patch zenservice lite-cr --namespace ap-console --type=merge --patch '{"spec": {"version":"4.2.0"}}'
- Run:
- Enabling FIPS fails after the 2.0.2.1 upgrade completes
- If you upgraded to the 2.0.2.1 version and try to re-enable FIPS, the command fails with the
following error:
dracut: installkernel failed in module kernel-modules-extra
as there are oldkernel-modules-extra
RPM on your system along with the new kernel RPM after the upgrade. To re-enable FIPS, you must first remove the old RPM from allcontrol
nodes after the upgrade. Contact IBM Support for assistance. After this is complete, you can re-enable FIPS. For more information, see Configuring FIPS on 2.0.2 Cloud Pak for Data System.Workaround:- Remove the old
kernel-modules-extra*
RPM corresponding to the oldkernel
version. For example, run:# yum remove kernel-modules-extra-4.18.0-193.13.2.el8_2.x86_64
- Remove the old
Other known issues
- Lenovo SD530 data collection for enclosures fails to collect enclosure Fast Failure Data Collection (FFDC) from the System Management Module (SMM)
- To avoid that issue, use sys_hw_util. Depending on your requirements, you can:
- Collect one SMM FFDC from one enclosure, for example
enclosure2
by running:/opt/ibm/appliance/platform/hpi/sys_hw_util ffdc -t e2n1 --smm
- Collect multiple SMM FFDC from all enclosures in a BASE+2 (four enclosures) system by
running:
/opt/ibm/appliance/platform/hpi/sys_hw_util ffdc -t e{1..4}n1 --smm
After you run this command on the console, you get the file location of the collected data. For example:Comp | FFDC Log Location | ---------+-----------------------------------------------+ e2smm | /var/log/appliance/platform/hpi/FFDC_e2smm/ |
- Collect one SMM FFDC from one enclosure, for example
- Some BMC/SMM configuration settings on CPDS can change unexpectedly
- Several Cloud Pak for Data System issues were seen recently whose
root cause was related to one or more BMC configuration settings having been changed from their
expected values. You can see several symptoms, depending on which settings were changed. One of the
cases observed recently on several different systems resulted in one or more NPS SPUs not being able
to PXE boot. This might result in Netezza Performance Server going into a DOWN state, depending on
how many SPUs are affected.To determine whether there are BMC settings that were modified from those values that are expected on the Cloud Pak for Data System platform, run:
if any nodes have any of these BMC settings that are incorrectly configured, the screen output from/opt/ibm/appliance/platform/hpi/sys_hw_check node
sys_hw_check
indicates that checks for that nodes failed. The report fromsys_hw_check
itemizes exactly which settings generated anERROR
Note: Some values of these settings, when corrected automatically reboot the corrected nodes for the changes to take effect. These specific settings, which cannot be changed without requiring a reboot are:DevicesandIOPorts.OnboardLANPort1 and 2
EnableDisableAdapterOptionROMSupport.OnboardLANPort1 and 2
MellanoxNetworkAdapter--Slot6 PhysicalPort1 LogicalPort1.NetworkLinkType and PhysicalPort2
(Connector Node only)DevicesandIOPorts.Device_Slot2 and 4
(Connector Node only)
Before you run
sys_hw_config
to correct any of these (or other) BMC settings, check the report log fromsys_hw_check
if any of the mentioned four BMC settings are flagged as incorrect, then resetting the expected values requires nodes to be rebooted. If so, ensure that these nodes reboots are acceptable before proceeding.Workaround:- Run:
to fix all of the incorrectly configured settings./opt/ibm/appliance/platform/hpi/sys_hw_config node
- Rerun:
to verify that the settings are correct.sys_hw_check
- If NPS went down due to this problem, perform the following actions:
- Run:
nzstop
- Run:
to bring NPS back online.nzstart
- Run:
chronyd
does not support poor quality windows NTP servers- Poor quality time sources might be automatically rejected by Cloud Pak for Data System. Provide production quality time sources to avoid that.
- Invalid python symbolic link getting set during 2.0.2.1 upgrade, which breaks
ansible
- This problem happens during 2.0.2.1 upgrade if the
ansible
command fails with the following error:
Workaround:[ERROR] Failed running ansible command: [DEPRECATION WARNING]: Distribution redhat 8.6 on host node1 should use [ERROR] /usr/libexec/platform-python, but is using /usr/bin/python for backward
- Verify the symbolic link on
e1n1
by running:
You can see that the link is not set on[root@e1n1 ansible]# ls -lah /usr/bin/python lrwxrwxrwx. 1 root root 36 Nov 2 2021 /usr/bin/python -> /etc/alternatives/unversioned-python
e1n2/e1n3
, for example:[root@e1n1 ansible]# ssh e1n2 ls -lah /usr/bin/python ls: cannot access '/usr/bin/python': No such file or directory
- Remove invalid symbolic link on
e1n1
. Run:[root@e1n1 ansible]# rm /usr/bin/python rm: remove symbolic link '/usr/bin/python'? y
- Verify the symbolic link on
- OCP node
fbond
advertising with wrong MAC address - This problem happens during the upgrade from 2.0.0 or 2.0.1 to 2.0.2.x. If you see an error
similar to the
following:
This might mean that the MAC address that is being advertised forx509: certificate is valid for 9.0.32.46, not 9.0.59.79
fbond
on the node is the incorrect MAC address. In the error example, the correct IP is the9.0.32.X
address, but the9.0.59.Y
address is in the dynamic range for devices that are not registered in DHCP (that is9.0.48.1-9.0.63.235
), which might mean that the node has an improper MAC address and the DHCP server has no record of it. Set the MAC address on the node back to the MAC address listed infab_1
of theSJC
.Workaround:- Run:
and find the node that has the IP address that matches the second IP from the error message that isoc get nodes -o wide
9.0.59.79
(eXnY
) from the example. - Find the correct MAC address for the node by looking in the
SJC
for the node, in the"fab_1"
section the component/subcomponent entry for the giveneXnY
in step 1. For example:"fab_1": { "mac_addr": "1c:34:da:48:c4:64", "sw_alias": "", "sw_port": "" },
- Run the following
commands:
ssh -ql core@eXnY (by using node name from step 1) MAC=MAC address from step 2 sudo nmcli con mod fbond 802-3-ethernet.mac-address $MAC sudo nmcli con mod fbond 802-3-ethernet.cloned-mac-address $MAC sudo nmcli con down fbond sudo nmcli con up fbond
- Run:
- OCS storage nodes reconfinguring is required when upgrading from 2.0.0 or 2.0.1.0 version to 2.0.2 or greater
- If you are running Cloud Pak for Data System version 2.0.0 or 2.0.1
and want to upgrade to 2.0.2.0 or greater, you must reconfigure OCS storage nodes. The chosen layout
of the Red Hat OpenShift Container Storage (OCS) 4.X cluster changed starting with Cloud Pak for Data System 2.0.1.1 onward. Before that, OCS was configured to
source the NVMe drives from all worker nodes to the OCS cluster in order to provide maximum space
utilization. However, sourcing drives to OCS costs vCPUs that reduce the amount of vCPUs available,
which you might want to use for Cloud Pak for Data (CPD). Also,
sourcing drives from more nodes increases the length of time that is required for
Machineconfig
updates, which are required by Red Hat OpenShift (OCP) for many operations. This is why in 2.0.1.1 onward, the cluster is kept to a minimal size by default to reserve the vCPU resources for the CPD applications themselves. There are scripts in place that allow you to scale up the OCS cluster if absolutely needed, but before you must consult with technical sales in accordance with SS4CP licensing terms.Run the following steps after the platform upgrade, but before the OCP upgrade. They must be run at that time. They help to determine whether you are ready to scale OCS down as required, and then how to perform that scale-down operation.- Determine whether you are using more than the amount of storage provided by a 3-node OCS OSD
cluster. Note: The amount of available space on the 3-node OCS OSD cluster is 15.36 TB. However, OCS is set up with 3-way replication, so the raw space capacity is 46.2 TB, which is equal to 42 TiB. You must scale the cluster down to that amount if you are not using more than that. OCS reports space usage not in usable capacity, but raw capacity, and not in TB or GB, but TiB or GiB.
- Determine whether the OCS cluster is ready to be scaled down.
ssh
toe1n1
and run:
for example :oc -n openshift-storage rsh `oc get pods -n openshift-storage | grep ceph-tool | cut -d ' ' -f1` ceph status | grep usage
if you are using more than 42 TiB, ignore the rest of these steps and proceed with the upgrade.[root@e1n1 ~]# oc -n openshift-storage rsh `oc get pods -n openshift-storage | grep ceph-tool | cut -d ' ' -f1` ceph status | grep usage usage: 1003 GiB used, 84 TiB / 85 TiB avail
- Scale down the OCS cluster.Note: This step might take some time and varies by cluster size. To calculate a rough estimate assuming that nothing fails and stops the automated scale-down early,
ssh
toe1n1
and run:
for example :nodes=(`oc get nodes | grep fbond | awk '{print $1}'`) ; echo "$(($((${#nodes[@]} - 6)) * 10)) minutes"
You can run the scale-down in a screen session to prevent closing the terminal window and prematurely stopping the process.[root@e1n1 ~]# nodes=(`oc get nodes | grep fbond | awk '{print $1}'`) ; echo "$(($((${#nodes[@]} - 6)) * 10)) minutes" 20 minutes
ssh
toe1n1
and run:
This scale-down step might fail. If it fails, you get an error message on the command line where you ran the script. After the failure, rerun it and see whether it progresses past the exact point it failed before. Run it until progress between runs is unchanged on consecutive runs or until it runs without any error messages anymore.nodes=(`oc get nodes | grep fbond | awk '{print $1}'`) ; for node in ${nodes[@]:6}; do /opt/ibm/appliance/platform/xcat/scripts/storage/ocs/remove_node.py --node $node; done
- Tail the log for this operation in a separate terminal window for monitoring, future reference,
or troubleshooting.
ssh
toe1n1
and run:
You can expect a similar output, for example:tail -f /var/log/appliance/platform/xcat/remove_ocs_node.log.tracelog
2022-10-19 10:05:55.221188 TRACE: 2022-10-19 10:05:56.221589 TRACE: Running command: oc rsh -n openshift-storage rook-ceph-tools-5d95856894-6mjlw ceph status 2022-10-19 10:05:56.965374 TRACE: Gathered ceph status: HEALTH_OK 2022-10-19 10:05:56.965871 TRACE: Running command: oc rsh -n openshift-storage rook-ceph-tools-5d95856894-6mjlw ceph health detail 2022-10-19 10:05:57.680450 TRACE: Running command: oc label node e2n4.fbond cluster.ocs.openshift.io/openshift-storage- 2022-10-19 10:05:57.850821 TRACE: Running command: oc label node e2n4.fbond topology.rook.io/rack- 2022-10-19 10:05:58.026213 TRACE: Running command: oc get nodes --show-labels 2022-10-19 10:05:58.188044 TRACE: Running command: oc get -n openshift-storage pods -l app=rook-ceph-operator 2022-10-19 10:05:58.356336 TRACE: Running command: oc get -n openshift-storage pods -l app=rook-ceph-operator 2022-10-19 10:05:58.518562 TRACE: Running command: oc rsh -n openshift-storage rook-ceph-tools-5d95856894-6mjlw ceph osd df 2022-10-19 10:06:04.229206 TRACE: Running command: oc get pdb -n openshift-storage 2022-10-19 10:06:04.378265 TRACE: ********** Node e2n4.fbond removed from OCS! **********
- Determine whether you are using more than the amount of storage provided by a 3-node OCS OSD
cluster.
- Automatic pod failover might not work in case of a node failure and pods might get stuck in
Terminating
state - The automatic pod failover might not work in the case of an abrupt node failure. OpenShift
notices that the node is not ready, and starts evicting the pods. But they get stuck in
Terminating
state.[root@gt14-node1 ~]# oc get pod -n nps-1 NAME READY STATUS RESTARTS AGE bnr-san-storage-server-0 1/1 Terminating 0 3h12m ipshost-0 1/1 Terminating 0 13d nps-1-init-host-7jfx4 1/1 Running 0 13d nps-1-init-host-c4gt4 1/1 Running 0 13d
Workaround:
Run the oc delete command:oc delete pod/<pod-name> --force --grace-period=0 -n <your-ns>
OR
Do a manualetcd
database update by running:$ oc project openshift-etcd
$ oc rsh -c etcdctl <etcd-pod-name>
sh-4.4# etcdctl del /kubernetes.io/pods/<namespace>/<pod-name>
- The 2.0.2.1 upgrade fails during
HpiPostinstaller.operatorpostinstallAbstractUpgrader
phase due to unstable pods - The 2.0.2.1 upgrade fails during
HpiPostinstaller.operatorpostinstallAbstractUpgrader
phase due to unstable pods with the following error:1. HpiPostinstaller.operatorpostinstall Upgrade Detail: Component post install for hpi Caller Info:The call was made from 'HpiPostinstaller.do_operator_postinstall' on line 341 with file located at '/localrepo/2.0.2.1_release/EXTRACT/platform/upgrade/bundle_upgraders/../h Message: AbstractUpgrader.postinstaller:Encountered error in postinstall of HPI operator {}
Workaround:- Run the following
command:
oc get pods -n ap-hpid -o wide | grep ap-hpid
- Monitor the pods by running the
oc
command in step 1 until they are healthy.Example of unhealthy pods:ap-hpid-daemonset-49ct4 0/1 Terminating 0 2d19h 9.0.32.21 e2n2.fbond <none> <none> ap-hpid-daemonset-9lv5z 0/1 Terminating 0 2d19h 9.0.32.20 e2n1.fbond <none> <none> ap-hpid-daemonset-bbvl5 0/1 ContainerCreating 0 7s 9.0.32.23 e2n4.fbond <none> <none> ap-hpid-daemonset-bp9lv 0/1 Pending 0 7s <none> <none> e2n1.fbond <none> ap-hpid-daemonset-pkqcn 0/1 Pending 0 7s <none> <none> e2n2.fbond <none> ap-hpid-daemonset-rhsvw 0/1 Pending 0 7s <none> <none> e1n4.fbond <none> ap-hpid-daemonset-xhgv6 0/1 ContainerCreating 0 7s 9.0.32.22 e2n3.fbond <none> <none> ap-hpid-daemonset-z28lx 1/1 Terminating 0 2d19h 9.0.32.19 e1n4.fbond <none> <none>
- Restart the upgrade if every pod is operational and healthy.Example of healthy pods:
ap-hpid-daemonset-5fbpf 1/1 Running 1 38h 9.0.32.22 e2n3.fbond <none> <none> ap-hpid-daemonset-8c2gw 1/1 Running 1 38h 9.0.32.23 e2n4.fbond <none> <none> ap-hpid-daemonset-db6h4 1/1 Running 0 38h 9.0.32.20 e2n1.fbond <none> <none> ap-hpid-daemonset-h7qmc 1/1 Running 1 38h 9.0.32.21 e2n2.fbond <none> <none> ap-hpid-daemonset-l4qlg 1/1 Running 2 38h 9.0.32.37 e5n1.fbond <none> <none> ap-hpid-daemonset-px7fq 1/1 Running 1 38h 9.0.32.28 e4n1.fbond <none> <none> ap-hpid-daemonset-qmmpc 1/1 Running 1 38h 9.0.32.19 e1n4.fbond <none> <none>
- Run the following
command: