Version 1.0.7.3 release notes
Cloud Pak for Data System version 1.0.7.3 includes security and monitoring enhancements, interactive visualizations in the web console, and a number of fixes.
Upgrading
- New preliminary check process is now available with
--preliminary-check-with-fixesoption. The process is no longer only checking for possible issues but it also attempts to automatically fix any known issues during pre-checks. - The recommended location to stage the VM and Services bundles is GPFS storage. You must use /opt/ibm/appliance/storage/platform/localrepo/<version> instead of /localrepo/<version>. The recommended filepath is provided in the upgrade procedure steps.
What's new
- More strict security settings for SSL connections
- Starting with version 1.0.7.3, a
modernsetting is used forROUTER_CIPHERSas a default. It provides an extremely high level of security. If you need to loosen the security settings, follow these steps:- Run the following command from
e1n1-1-control:oc edit dc router -n default - Customize the
ROUTER_CIPHERSas you need, for example, change the value tointermediate, which is what was used in previous releases. For more information, see https://wiki.mozilla.org/Security/Server_Side_TLS.
- Run the following command from
- Log forwarding feature
- You can now forward the control node logs to a remote log server with the apsyslog command as described in Forwarding logs to a remote server.
- Serviceability
- The Netezza related logs (specifically SPU logs) collection with apdiag will
now need the
--spusoption. You can use the--spusoption with a space-separated list of SPU names, or use--spus allto collect logs for all SPUs in the system. For example:
For more information on apdiag, see apdiag command reference.apdiag collect --components ips/ --spus spu0101 spu0102 - Web console improvements
- New, interactive visualizations of hardware components and their actual location in the rack. You can zoom in and out the rack view, filter the view by status or by component type, and view the selected component details.
Software components
- Cloud Pak for Data 3.0.1 System Edition
Read the following release notes to learn about new features introduced in version 3.0.1: Cloud Pak for Data 3.0.1 Release Notes.
- Portworx 2.5.0.1
- Netezza® Performance Server for Cloud Pak for Data System 11.0.6.1
Read the release notes at Netezza Performance Server 11.0.6.1 release notes.
Fixed issues
- Fixed the issue with
portworx-apipods missing on control nodes. - Fixed the issue with ap sw showing wrong web console version.
- Fixed the OpenShift upgrade issues with route reconfiguration.
- Fixed the issue with services upgrade failing due to cluster certificate expiry errors.
- Fixed the user management issue with the
adminuser not being able to add services in the web console after the upgrade. - Fixed the issue with the upgrade failing due to timeout while running the appcli healthcheck --app=portworx command.
- Fixed the issues with services upgrade failing with a timeout errors.
- Fixed the Call Home page access issues for Asia-Pacific customers.
- Fixed the issue with the system being not accessible from a remote system using the external IP after network configuration steps (alert 610: Floating IP interface is up, but inactive in network manager). Platform manager self-corrects the issue.
- Fixed the issue with multiple Unreacheable or missing device for NSD events being opened.
Known issues
- Upgrade might fail if e1n1 is not the hub in Platform Manager
- Before you start the upgrade ensure that e1n1 is the hub, as described in the upgrade procedure.
Otherwise several nodes might become inaccessible or not responding in the expected timeframe after
nodes reboot, and the upgrade fails with a similar
error:
To recover from the error:2021-03-09 21:40:05 INFO: Some nodes were not available. [u'e15n1', u'e16n1', u'e1n4', u'e2n1', u'e2n2', u'e2n3', u'e2n4'] 2021-03-09 21:40:05 ERROR: Error running command [systemctl restart network] on [u'e2n4-fab', u'e2n3-fab', u'e15n1-fab', u'e2n1-fab', u'e16n1-fab', u'e1n4-fab', u'e2n2-fab'] 2021-03-09 21:40:05 ERROR: Unable to restart network services on [u'e15n1-fab', u'e16n1-fab', u'e1n4-fab', u'e2n1-fab', u'e2n2-fab', u'e2n3-fab', u'e2n4-fab'] 2021-03-09 21:40:05 ERROR: ERROR: Error running command [systemctl restart network] on [u'e2n4-fab', u'e2n3-fab', u'e15n1-fab', u'e2n1-fab', u'e16n1-fab', u'e1n4-fab', u'e2n2-fab'] 2021-03-09 21:40:05 ERROR: 2021-03-09 21:40:05 ERROR: Unable to powercycle nodes via ipmitool. 2021-03-09 21:40:05 ERROR: 'bmc_addr' 2021-03-09 21:40:05 ERROR: The following nodes are still unavailable after a reboot attempt: [u'e15n1', u'e16n1', u'e1n4', u'e2n1', u'e2n2', u'e2n3', u'e2n4'] 2021-03-09 21:40:05 FATAL ERROR: Problem rebooting nodes- Check for the hub node by verifying that the
dhcpdservice is running on the control nodes:systemctl is-active dhcpd - If the
dhcpdservice is not running, on e1n1, run:systemctl start dhcpd - Bring up the floating IPs:
- On e1n2 and e1n3,
run:
/opt/ibm/appliance/platform/management/actions/master_disable.py -scope master - On e1n1,
run:
/opt/ibm/appliance/platform/management/actions/master_enable.py -scope master
- On e1n2 and e1n3,
run:
- Rerun the upgrade where it failed.
- Check for the hub node by verifying that the
- OpenShift certificates might expire after a year
- It is required to run the annual certificate expiry check and ensure the certificates are validated for the following year. For more information, see Running Red Hat OpenShift certificate expiry check.
- System web console starts with the onboarding procedure after the upgrade
- Starting with version 1.0.7.3, the U-position setting in hardware configuration for enclosures and switches is mandatory. If your system does not have these fields set, the web console opens up with the onboarding procedure and you must fill in all the required fields to be able to proceed to the home page.
- Upgrade from 1.0.4.x with switch upgrade fails with GPFS fatal error but restarts
- System bundle upgrade might fail at GPFS upgrade with a fatal error in the
log:
However, the nodes are rebooted and the upgrade restarts and continues. GPFS is upgraded and you can ignore the failure.58:34 FATAL ERROR: GpfsUpgrader.install : Fatal Problem: Could not copy files to all nodes. - Portworx volume mount fails with error
signal: killedor exit status 32 - If an application pod fails to start and the Events section in pod description shows volume
mount issue due to
signal:killederror, for example:
apply the workaround described in Troubleshooting issues with Portworx.[root@e1n1-1-control ~]# oc describe pod <pod-name> -n <pod-namespace> | grep -A 10 ^Events: Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning FailedMount 49s (x8 over 3m) kubelet, e2n1-1-worker.fbond MountVolume.SetUp failed for volume "pvc-9731b999-f85b-11ea-be79-003658063f81" : rpc error: code = Internal desc = Failed to mount volume pvc-9731b999-f85b-11ea-be79-003658063f81: signal: killed - False
Portworx upgrade failedmessage in the upgrade log - If the upgrade reports that the Portworx upgrade failed and you can see the following
error:
Portworx upgraded successfully but the version was not updated in time on one of the nodes. Workaround:Running version not as expected: Failed to confirm PX upgrade version- Verify that in the appmgt.log the Portworx upgrade is
successful:
2020-09-29 10:18:19,130 [INFO] <appmgnt.rest.server.restServer:bound_func:110058> [appmgnt.tools.portworx.portworx(portworx.py:645)] - portworx_upgrade: all portworx pods in the cluster are up and running 2020-09-29 10:18:19,132 [INFO] <appmgnt.rest.server.restServer:bound_func:110058> [appmgnt.tools.portworx.portworx(portworx.py:646)] - portworx_upgrade: check_px_pods - Finish. 2020-09-29 10:18:19,133 [INFO] <appmgnt.rest.server.restServer:bound_func:110058> [appmgnt.tools.portworx.portworx(portworx.py:657)] - Portworx upgrade: confirm_px_upgrade_success - entering.... 2020-09-29 10:18:19,135 [INFO] <appmgnt.rest.server.restServer:bound_func:110058> [appmgnt.utils.portworx_util.PortworxUtil(portworx_util.py:376)] - get_px_running_versions: start.... 2020-09-29 10:18:19,137 [INFO] <appmgnt.rest.server.restServer:bound_func:110058> [appmgnt.utils.command(command.py:189)] - Executing: /bin/bash -c pxctl cluster list |sed '1,6d' | sed 's/[[:space:]]\+/#/g' |cut -d '#' -f 10 2020-09-29 10:18:19,139 [INFO] <appmgnt.rest.server.restServer:bound_func:110058> [appmgnt.utils.command(command.py:198)] - with input = None, env = None, dir = None 2020-09-29 10:18:19,304 [INFO] <appmgnt.rest.server.restServer:bound_func:110058> [appmgnt.utils.command(command.py:234)] - returncode = 0, raiseOnExitStatus = True 2020-09-29 10:18:19,304 [DEBUG] <appmgnt.rest.server.restServer:bound_func:110058> [appmgnt.utils.command(command.py:243)] - 2020-09-29 10:18:19,307 [DEBUG] <appmgnt.rest.server.restServer:bound_func:110058> [appmgnt.utils.command(command.py:244)] - 2.5.0.1-ff37efc 2.3.1.1-3fbe1bf 2.5.0.1-ff37efc 2.5.0.1-ff37efc 2.5.0.1-ff37efc 2.5.0.1-ff37efc 2020-09-29 10:18:19,309 [INFO] <appmgnt.rest.server.restServer:bound_func:110058> [appmgnt.utils.portworx_util.PortworxUtil(portworx_util.py:379)] - get_px_running_versions: ['2.5.0.1-ff37efc', '2.3.1.1-3fbe1bf', '2.5.0.1-ff37efc', '2.5.0.1-ff37efc', '2.5.0.1-ff37efc', '2.5.0.1-ff37efc'] 2020-09-29 10:18:19,310 [INFO] <appmgnt.rest.server.restServer:bound_func:110058> [appmgnt.tools.portworx.portworx(portworx.py:661)] - Portworx has successfully upgraded to: 2.5.0.1-ff37efc - Run commands to verify that Portworx is upgraded and running new
version:
pxctl statusoc get pods -n kube-system - Resume the upgrade process.
- Verify that in the appmgt.log the Portworx upgrade is
successful:
- Switch configuration automation breaks the Netezza Performance Server network setup
- Version 1.0.7.0 introduces automation for application network switch settings. However, if Netezza Performance Server is installed on your system, and you decide to modify your application network setup, do not run the switch automation step, as it will break the Netezza instance. You must configure the switches manually as described in Deprecated: Manually configuring fabric network switches.
- Upgrade breaks house network configuration for systems with NPS
- NPS systems with a 10G house network will have broken management network operations (dns/ldap/time/callhome) following the upgrade. If you have Netezza installed, follow the post-upgrade steps described in Netezza Performance Server post-upgrade steps to fix the network setup.
- Upgrade from versions earlier than 1.0.7.x with NPS installed is failing
- When NPS is installed on a system running versions earlier than 1.0.7.x,
apupgrade might fail due to unknown MAC address for fabric switch.
Workaround:
- Rerun apupgrade with
--skip-firmwareand--skip-hw-cfgoptions:apupgrade --upgrade --use-version 1.0.7.3_release --upgrade-directory /localrepo --bundle system --skip-firmware --skip-hw-cfg - Continue with the upgrade.
- Once successfully finished, rerun the system bundle upgrade to update the firmware on the
system:
apupgrade --upgrade --use-version 1.0.7.3_release --upgrade-directory /localrepo --bundle system
- Rerun apupgrade with
- Cloud Pak for Data must be installed on Zen namespace
- Cloud Pak for Data must always be installed on Zen namespace. Otherwise, the upgrade of this component will fail, and the VM must be redeployed, which means all data is lost.
- Upgrade precheck fails with IP conflict error
- If the upgrade precheck fails with a similar
error:
verify if ASA/ARP settings are enabled. Cisco device configuration needs to be fixed so that it doesn't respond to the ARP requests which are not designated to it.ERROR : [/etc/sysconfig/network-scripts/ifup-eth] Error, some other host (40:F2:E9:2F:99:57) already uses address 10.8.9.250. - apupgrade precheck failing because of missing files
-
In some of cases the yosemite-kube-1.0.7.3-SNAPSHOT-x86_64.tar.gz file might be missing from /opt/ibm/appliance/storage/platform/upgrade/icp4d_console/ and /install/app_img/icp4d_console location.
Workaround:
Copy the missing files from the existing upgrade bundle to the following locations:- /opt/ibm/appliance/storage/platform/upgrade/icp4d_console/
- /install/app_img/icp4d_console
- Service bundle upgrade failing due to a read timeout
- If a similar error is logged during
upgrade:
2020-10-02 09:24:44 TRACE: RC: 255. STDOUT: [{"error": {"description": "Unexpected error happen, failed to get response: HTTPSConnectionPool(host='127.0.0.1', port=9443): Read timed out. (read timeout=600)", "errorName": "HttpsRequestError", "statusCode": 500}, "result": "Failure", "statusCode": 500} ] STDERR: [ ]The system should come online after some time. Ignore the error and continue with the upgrade. - The Save button for the Call Home configuration in the web console is inactive
- You must first enable either Problem Reporting or Status Reporting in the Control tab of the Call Home settings in order to be able to save any further configuration settings entered for Call Home and Notifications (Alert Management).
appmgnt-restservice is stopped during upgrade- If a similar error message is seen in the
apupgradeorappmgntlogs during upgrade:
Run the following command from e1n1 to verify that platform manager has deactivated the service:appmgnt.utils.util: ERROR failed to check the status of appmgnt-rest {"error": {"description": "please make sure appmgnt-rest.service is running and try again.", "errorName": "RestServiceNotRunningError", "statusCode": 500}, "result": "Failure", "statusCode": 500}curl -sk https://localhost:5001/apupgrade/progress -u a:$(cat /run/magneto.token) -X GET
Workaround:[root@e1n1 ~]# curl -sk https://localhost:5001/apupgrade/progress -u a:$(cat /run/magneto.token) -X GET {"status": 0, "upgrade_in_progress": "False"}- Enable the platform manager upgrade mode on
e1n1:
curl -k https://localhost:5001/apupgrade/progress -u a:$(cat /run/magneto.token) -X PUT -d '{"upgrade_in_progress": "True"}'[root@e1n1 ~]# curl -k https://localhost:5001/apupgrade/progress -u a:$(cat /run/magneto.token) -X PUT -d '{"upgrade_in_progress": "True"}' {"status": 0, "message": "Application upgrade is enabled"} -
Log in to
e1n1-1-controland restart theappmgnt-service:systemctl restart appmgnt-restsystemctl status appmgnt-rest -l
- Enable the platform manager upgrade mode on
e1n1:
systemdtimeout issues- The default log level for OpenShift is 2 which sometime leads to timeouts on
systemd. The symptom for this issue is alerts raised by platform manager, for example:2171 | 2020-10-24 19:34:50 | SW_NEEDS_ATTENTION | 436: Failed to collect status from resource manager | vm_systemd_services@hw://enclosure2.node2 | MAJOR | N/A | 2172 | 2020-10-25 01:58:27 | SW_NEEDS_ATTENTION | 436: Failed to collect status from resource manager | vm_systemd_services@hw://enclosure2.node1 | MAJOR | N/A |In addition, the system might show the following symptoms:- OpenShift nodes not SSH-able;
systemctlcommands on VM nodes not responding- OpenShift nodes not ready for a long time and system response slow
virsh list --allto get the VM name;virsh destroy <VM name>to force stop the non-responding VM;virsh list --allto verify that the VM shows Shutdown;virsh start <VM name>to restart the VM.
Ensure the OpenShift nodes are all in ready state before proceeding with the workaround steps:oc get node -n wide for i in $(oc get node --no-headers | awk '{print $1}'); do ssh $i systemctl status atomic-openshift-node; doneThis can help get all the OpenShift nodes back to Ready states, and provide a clear environment before applying the workaround.
Workaround:
Run the following commands from any of the control nodes, for example,e1n1-1-control:- Change the log level in
master.envfile on all masters:$ for i in $(oc get node | grep master | awk '{print $1}'); do ssh $i "sed -i 's/DEBUG_LOGLEVEL=2/DEBUG_LOGLEVEL=0/' /etc/origin/master/master.env"; - Restart the controller on master
nodes:
$ for i in $(oc get node | grep master | awk '{print $1}'); do ssh $i "master-restart api && master-restart controllers"; done - Wait for the service to be up and running:
- Run the following command from any of the control
nodes:
for i in $(oc get node --no-headers | grep master | awk '{print $1}'); do oc get pod --all-namespaces | grep -E "master-api-$i|master-controllers-$i"; done - Make sure output for the above command looks similar to the following example, where all nodes
are in
Runningstate with1/1in third column in output for each line. You can run the command multiple times, or wait for a couple of minutes to get all the respective pods up and running. Example:[root@prunella1 ~]# for i in $(oc get node | grep master | awk '{print $1}'); do echo $i && oc get pod --all-namespaces | grep -E "master-api-$i|master-controllers-$i"; done prunella1.fyre.ibm.com kube-system master-api-prunella1.fyre.ibm.com 1/1 Running 1 6d kube-system master-controllers-prunella1.fyre.ibm.com 1/1 Running 1 6d prunella2.fyre.ibm.com kube-system master-api-prunella2.fyre.ibm.com 1/1 Running 2 6d kube-system master-controllers-prunella2.fyre.ibm.com 1/1 Running 1 6d prunella3.fyre.ibm.com kube-system master-api-prunella3.fyre.ibm.com 1/1 Running 1 6d kube-system master-controllers-prunella3.fyre.ibm.com 1/1 Running 1 6d
- Run the following command from any of the control
nodes:
- Change
atomicOpenShift node log level:$ for i in $(oc get node | awk '{print $1}'); do ssh $i "sed -i 's/DEBUG_LOGLEVEL=2/DEBUG_LOGLEVEL=0/' /etc/sysconfig/atomic-openshift-node"; done - Restart the node
services:
$ for i in $(oc get node | awk '{print $1}'); do ssh $i "systemctl restart atomic-openshift-node.service"; done - Make sure node service is active on all
nodes:
Ensure that allfor i in $(oc get node --no-headers | awk '{print $1}'); do ssh $i "systemctl status atomic-openshift-node.service | grep 'Active:'"; doneActive:statuses areactive (running)as in the sample output:FIPS mode initialized Active: active (running) since Thu 2020-10-29 12:38:05 PDT; 6 days ago FIPS mode initialized Active: active (running) since Thu 2020-10-29 12:38:05 PDT; 6 days ago FIPS mode initialized Active: active (running) since Thu 2020-10-29 12:38:05 PDT; 6 days ago FIPS mode initialized Active: active (running) since Thu 2020-10-29 12:38:26 PDT; 6 days ago FIPS mode initialized Active: active (running) since Thu 2020-10-29 12:38:28 PDT; 6 days ago FIPS mode initialized Active: active (running) since Thu 2020-10-29 12:38:28 PDT; 6 days ago
- Services upgrade summary reports blank values for version numbers
- Once the services upgrade completes successfully, when it reports the final results of the
upgrade, it reports empty brackets instead of version numbers, as in the following
sample:
This is a known issue.Upgrade results services : successfully upgraded from {} to {} services Upgrade time : 3:16:47 All upgrade steps complete. Broadcast message from root@e1n1 (Thu Sep 24 21:10:40 2020): A running upgrade completed successfully. Broadcast message from root@e1n1 (Thu Sep 24 21:10:40 2020): See details at /var/log/appliance/apupgrade/20200924/apupgrade20200924173509.log for further information - Upgrade to 1.0.7.3 leaves a
db2whpod in unhealthy state - If after the upgrade one of the
db2whpods is in an unhealthy state as in the example:[root@e1n1-1-control ~]# oc get pods --all-namespaces=True | egrep -vi "Running|Complete" NAMESPACE NAME READY STATUS RESTARTS AGE zen db2wh-1600568372108-db2u-ldap-b9f64d4f4-2n2mm 0/1 Init:0/1 0 7hRun the following steps to fix it:- Run the following command to check the status of the
pods:
Make sure to give some time for the pod to come up after the upgrade. Sometimes it might take a bit longer, but not in hour(s).oc get oc get pods --all-namespaces=True | egrep -vi "Running|Complete" - Run the following command:
For example:oc describe pod -n zen
the following output is shown:oc describe db2wh-1600568372108-db2u-ldap-b9f64d4f4-2n2mm -n zenWarning Unhealthy 3m kubelet, e2n3-1-db2wh.fbond Liveness probe failed: Warning Unhealthy 2m (x2 over 3m) kubelet, e2n3-1-db2wh.fbond Readiness probe failed: - Delete the pod to
recover:
oc delete pod db2wh-1600568372108-db2u-ldap-b9f64d4f4-2n2mm -n zen
- Run the following command to check the status of the
pods:
- CSR pending on the worker nodes
- The CSRs (Customer Signing Requests) are generated to extend the node certificate validity after
the default one year validity passes. You can approve them manually having the cluster administrator
privileges and enable the auto-approver to approve them automatically in the future. If CSR is in
pending status, the system might show the following symptoms:
- Worker nodes in
NotReadystate and pods may showNodeLost:[root@e2n1-1-worker ~]# oc get node NAME STATUS ROLES AGE VERSION e1n1-1-control.fbond Ready compute,infra,master 1y v1.11.0+d4cacc0 e1n2-1-control.fbond Ready compute,infra,master 1y v1.11.0+d4cacc0 e1n3-1-control.fbond Ready compute,infra,master 1y v1.11.0+d4cacc0 e1n4-1-worker.fbond NotReady compute 1y v1.11.0+d4cacc0 e2n1-1-worker.fbond NotReady compute 1y v1.11.0+d4cacc0 e2n2-1-worker.fbond NotReady compute 1y v1.11.0+d4cacc0 atomic-openshift-nodeservice failure to start on the worker nodes- Error message in
atomic-openshift-nodelog from the problem mode:Jan 19 22:52:49 e2n1-1-worker.fbond atomic-openshift-node[19159]: E0119 22:52:49.180917 19159 bootstrap.go:195] Part of the existing bootstrap client certificate is expired: 2021-01-15 00:28:00 +0000 UTC Jan 19 22:52:49 e2n1-1-worker.fbond atomic-openshift-node[19159]: I0119 22:52:49.180947 19159 bootstrap.go:56] Using bootstrap kubeconfig to generate TLS client cert, key and kubeconfig file - Showing CSR
Pendingcondition, when checking the status:[root@e1n1-1-control ~]# oc get csr NAME AGE REQUESTOR CONDITION csr-24sgr 17h system:admin Pending csr-25bvd 10h system:node:e1n3-1-control.fbond Pending csr-26c4l 4h system:node:e1n2-1-control.fbond Pending csr-2757q 4h system:node:e1n2-1-control.fbond Pending csr-27l4c 4h system:admin Pending csr-2j4h8 16h system:node:e1n2-1-control.fbond Pending csr-2knq7 5h system:node:e1n2-1-control.fbond Pending csr-2ltdq 16h system:admin Pending csr-2mvdd 5h system:node:e1n3-1-control.fbond Pending csr-2nf2k 8h system:node:e1n2-1-control.fbond Pending csr-2rtj7 6h system:admin Pending
Workaround:- Run the certificate approval
manually:
oc get csr -o name | xargs oc adm certificate approveIn
oc getnode, the nodes can recover to Ready state after a few minutes. - Deploy the auto-approver by
running:
ansible-playbook -i /opt/ibm/appmgt/config/hosts /usr/share/ansible/openshift-ansible/playbooks/openshift-master/enable_bootstrap.yml -vvvv -e openshift_master_bootstrap_auto_approve=true - Add
openshift_master_bootstrap_auto_approve=trueto /opt/ibm/appmgt/config/hosts:... [OSEv3:vars] openshift_master_bootstrap_auto_approve=true ... - Run the annual certificate expiry check. For more information, see Running Red Hat OpenShift certificate expiry check.
- Worker nodes in