Version 1.0.7.0 release notes
IBM® Cloud Pak for Data System version 1.0.7.0 introduces automated switch configuration and automated node expansion. It also includes the 1.0.5.2 hotfix, and a few enhancements to the web consoles.
Upgrading
The upgrade procedure is described in Upgrading to version 1.0.7.0. Your system must be on 1.0.4.x or newer to upgrade.What's new
- Network
-
- Switch configuration automation
The
System_Name.ymlfile is extended to include the switch configuration for customer facing ports. Note that theswitchessection is only available in the template after you upgrade the system, so you have to manually add this section before the upgrade if you want to preserve switch configuration. For more information refer to the upgrade instructions. - yml file validation tool
The aposYmlCheck.py utility is now available, which ensures that the yml file is valid, and that all the required values (IP address, netmask, gateway) meet their proper typing. For more information, see Validating the YAML file.
- Switch configuration automation
- Node expansion automated discovery
- When the system is expanded, new nodes are automatically discovered. You can then configure them as described in Adding new nodes to the system.
- NodeOS
-
- Custom hostname support
- Improved DHCP scaling
- Reduced OpenShift Container Platform control VM memory
- Platform management (already introduced in 1.0.5.2)
-
- Better handling of Netezza host container migration from a heavy loaded node (high SWAP and CPU usage). Before, the host was migrated from e1n1 to e1n2 and back to e1n1. Now it will only be moved once. Also the floating IP (the IP for connecting to the system) if moved from e1n1 to e1n2 will not go back to e1n1 for as long as e1n2 is OK.
- The Netezza host container will wait for the floating IP and confirmation that the other Netezza host containers from the previous node are STOPPED. The timeout for such wait is 5 minutes. When that time passes, the container is started even if the floating IP is not up or the other Netezza host container is not STOPPED.
- Storage utilization alert for /var/log/audit will no longer be triggered for over 80%, by default it can take more than 80%. After upgrading to 1.0.5.2, the old storage utilization alerts must be manually closed with ap issues --close <alert_id>.
- The state of Portworx cluster is now reported with ap sw and not only ap sw -d.
- Web console
-
- Storage options such as Portworx, hostpath for nodes running Db2 Warehouse
- Resource usage enhancements including metering information for software and hardware
- NPS® console new widgets
-
- Query Performance
- Queries summary for Period
- Queue monitor
- Allocation charts
- Swap space usage
- Resource monitors with spu and maxspu
- Security
- Excluded the POSIX Attributes requirements of users in Windows Active Directory to integrate with Cloud Pak for Data System.
Components
- Red Hat Enterprise Linux 7.8
- Red Hat OpenShift Container Platform 3.11.188
- GPFS 4.2.3-22
- Cumulus operating system for fabric and management switches upgraded to version 3.7.11Note: When you run the system bundle upgrade, the upgrade process attempts to quietly update firmware, Cumulus and OS on the fabric and management switch. If errors occur during that stage, a warning that the switches were unable to be updated is displayed. You can continue with the upgrade process to upgrade the remaining components. The system will be operational when you finish, but it is strongly advised that you contact IBM Support to assist in fixing the issues with the switch updates. They must be upgraded before you apply the next upgrade.
Fixed issues
- Fixed the issue with alerts being opened for the Cloud Pak for Data not ready while the system is still starting. Introduced the 45 minutes wait time for the Cloud Pak for Data and OpenShift to start and be fully operational. If they do not start properly during this time, an alert is opened.
- Fixed an issue with Finisar transceivers not working correctly in nodes - implemented a workaround to force link speed and duplex settings. While this issue is fixed in Cloud Pak for Data System 1.0.7.0, it is still present on systems with Netezza® Performance Server for Cloud Pak for Data System version 11.0.5.0 and lower. Upgrade the Netezza Performance Server for Cloud Pak for Data System to version 11.0.6.0 to avoid the issue.
Known issues
- OpenShift certificates might expire after a year
- It is required to run the annual certificate expiry check and ensure the certificates are validated for the following year. For more information, see Running Red Hat OpenShift certificate expiry check.
- Upgrade breaks house network configuration for systems with NPS
- NPS systems with a 10G house network will have broken management network operations (dns/ldap/time/callhome) following the upgrade. If you have Netezza installed, follow the steps described in Netezza Performance Server section in the 1.0.7.0 upgrade procedure.
- Services bundle upgrade might fail reporting unhealthy pod
- If services bundle upgrade fails and the upgrade log indicates that the failure is in
Post-Upgrade Health
Check:
then confirm that OpenShift upgrade is complete via presence of these messages in /opt/ibm/appliance/storage/platform/appmgt/logs/appmgt.log:ERROR: Post-Upgrade Health Check: Openshift health check failed: Unhealthy pod exists in the cluster, please check the log /opt/ibm/appliance/storage/platform/appmgt/logs/appmgt.log for more detail.
and podAnsible upgrade complete successfully. After this point, any failure may not indicate you have to rerun the upgrade again to fix it. Enter post_upgrade_check for rhosmaster-etcd-e1n1-1-control.fbondis unhealthy due toImage is not set[root@e1n1-1-control ~]# oc get events -n kube-system | grep master-etcd-e1n1-1-control.fbond 3m 3h 997 master-etcd-e1n1-1-control.fbond.162015ab43ee33a1 Pod Warning FailedSync kubelet, e1n1-1-control.fbond error determining status: Image is not setWorkaround:
See step 15 in the upgrade procedure.Note: To prevent this issue, run the following commands before you start the services upgrade:- To clean up the NotReady pods on OpenShift nodes, run the following command from
e1n1-1-control:nodes=$(oc get node --no-headers | awk '{print $1}') for n in $nodes; do ssh $n "pods=$(crictl pods -q --s NotReady) && if [ ! -n $pods ]; then crictl rmp $pods; fi "; done - Run:
oc get pod -n default oc get pod -n kube-system
- To clean up the NotReady pods on OpenShift nodes, run the following command from
- Upgrade fails with OpenShift health check error
- The upgrade process might terminate with a similar
error:
Logfile: /var/log/appliance/apupgrade/20200705/apupgrade20200705085921.log Upgrade Command: apupgrade --upgrade --application all --upgrade-directory /localrepo --bundle services --use-version 1.0.7.0_release 1. Rhos_ContainersUpgrader.preinstall Upgrade Detail: Component install for services Caller Info:The call was made from 'Rhos_ContainersPreInstaller.preinstall' on line 64 with file located at '/localrepo/1.0.7.0_release/EXTRACT/services/upgrade/bundle_upgraders/../services/rhos_containers_preinstaller.py' Message: {u'message': [u'rhos failed Health Check, check log for more detail'], u'statusCode': 400, u'result': u'Failure', u'error': {u'description': u'Openshift health check failed: Unhealthy pod exists in the cluster, please check the log /opt/ibm/appliance/storage/platform/appmgt/logs/appmgt.log for more detail.', u'errorName': u'OpenshiftHealthCheckError', u'statusCode': 400}}Workaround:
- Log into
e1n1-1-control. - Run the following command to verify that all pods show status
READY 1/1:
If any of the pods show[root@e1n1-1-control ~]# oc get pod -n default NAME READY STATUS RESTARTS AGE docker-registry-7-sn6zg 1/1 Running 0 9m registry-console-2-56qx7 1/1 Running 1 31d router-7-crnps 1/1 Running 0 14m router-7-dh95z 1/1 Running 0 15m router-7-s8lvm 1/1 Running 0 15mContainerCreating, wait a moment and rerun the command. You might need to run it a couple of times until all pods are inREADY 1/1state. - Log into e1n1 and restart the upgrade process with the following
command:
apupgrade --upgrade --application all --upgrade-directory /localrepo --bundle services --use-version 1.0.7.0_release --application all
- Log into
- Unable to add services with
adminuser in the web console - Adding services such as Data Virtualization, or new projects for Python Jupyter Notebook etc.
must be executed with an administrative user different than the default
adminuser. The differentuidvalues used inadminuser will cause errors when adding services in the Cloud Pak for Data console. A fix for this issue is provided as step 16 in the upgrade procedure. - Network configuration issues
-
- Switch configuration not preserved during upgrade.
Workaround:
Add the
switchessection to the System_Name.yml file before the upgrade to 1.0.7.0 starts, as described in Upgrading to version 1.0.7.0. - Switch configuration automation breaks the Netezza Performance Server network setup.
Version 1.0.7.0 introduces automation for application network switch settings. However, if Netezza Performance Server is installed on your system, and you decide to modify your application network setup, do not run the switch automation step, as it will break the Netezza instance. You must configure the switches manually. If you ran the automation script by mistake, contact IBM Support to fix the configuration. IBM Support may reference this document (password required).
- Switch configuration not preserved during upgrade.
- Services installed with
portworx-shared-gpmight cause data loss - If any Cloud Pak for Data services were deployed using
portworx-shared-gpstorage class with single replica count storage volume, it could cause data loss and raise issue with Portworx not being able to disable a node due to single replication.Workaround:
Run the following steps from a control VM (or a node with Portworx where pxctl commands can be run):- Increase replica count from 1 to
2:
for i in $(pxctl v l |awk '$5 ~ /^1$/ {print $1}' ); do pxctl v u --repl 2 $i; done - Increase replica count from 2 to
3:
for i in $(pxctl v l |awk '$5 ~ /^2$/ {print $1}' ); do pxctl v u --repl 3 $i; done
- Increase replica count from 1 to
2:
- apupgrade command output suggests wrong upgrade command syntax
- When running apupgrade --preliminary-check command before the upgrade, the
command output suggest the next command to run, but the syntax in the sample is wrong:
Follow the upgrade procedure as described in Upgrading to version 1.0.7.0 to avoid errors.Please review the release notes for this version at http://ibm.biz/icpds_rn prior to running the upgrade. Upgrade command: apupgrade --upgrade --use-version 1.0.7.0_release --upgrade-directory /localrepo --bundle services - Errors when running house network configuration check mode
- When running the following command in the Testing the YAML file and running playbooks
procedure:
in any of the following cases:ANSIBLE_HASH_BEHAVIOUR=merge ansible-playbook -i ./System_Name.yml playbooks/house_config.yml --check -v- no DNS setup (initial installation)
- DNS servers being changed
- switch/IP networks being changed
TASK [Attempt to reach FQDN from inside VM] **************************************************************************************************************************** fatal: [node1]: FAILED! => {"changed": true, "cmd": "ssh apphub.fbond \"ping -c 3 gt04-app.swg.usma.ibm.com\"", "delta": "0:00:00.466787", "end": "2020-07-09 21:51:33.311372", "msg": "non-zero return code", "rc": 2, "start": "2020-07-09 21:51:32.844585", "stderr": "\nping: gt04-app.swg.usma.ibm.com: Name or service not known", "stderr_lines": ["", "ping: gt04-app.swg.usma.ibm.com: Name or service not known"], "stdout": "", "stdout_lines": []} ...ignoring TASK [Check FQDN result] *********************************************************************************************************************************************** fatal: [node1]: FAILED! => {"changed": false, "msg": "Failed to reach FQDN gt04-app.swg.usma.ibm.com"}You can ignore these errors and continue with the procedure. - Upgrade fails with floating IP issues
- The following error message might be logged at the upgrade precheck step or later during the
system
upgrade:
Unable to bring up floating ip 9.0.32.15/19 because of ERROR: Error running command [ifup 9.0.32.15/19]. RC: 1. STDOUT: [] STDERR: [/sbin/ifup: configuration for 9.0.32.15/19 not found. Usage: ifup
Workaround:
- Check for the hub node by verifying that the
dhcpdservice is running. Run the command:
If thesystemctl is-active dhcpddhcpdservice is running on a node other than e1n1, bring the service down on that node:
and start the service on e1n1:systemctl stop dhcpdsystemctl start dhcpd - On all control nodes other than e1n1, run the following two
commands:
ifdown fab-br0:Flifdown mgt-br0:Fl - On e1n1 run the following two
commands:
ifup fab-br0:Flifup mgt-br0:Fl - Check if the floating IPs are
up:
9.0.32.15/19 : fab-br0:Fl 9.0.0.15/19 : mgt-br0:Fl[root@e1n1 ~]# ip addr show | grep 9.0.32.15/19 inet 9.0.32.15/19 brd 9.0.63.255 scope global secondary fab-br0:99 [root@e1n1 ~]# ip addr show | grep 9.0.0.15/19 inet 9.0.0.15/19 brd 9.0.31.255 scope global secondary mgt-br0:99
- Check for the hub node by verifying that the
- The nzstart command returns errors such as nzstart: Error: insufficient tmpfs size
- If you are upgrading from version 1.0.4.x to 1.0.7.0 you might encounter the following
error after running
nzstart:
nzstart: Error: insufficient tmpfs size
In that case, redeploy the NPS container as described in Redeploying the container. appmgnt-restservice is stopped during upgrade- If a similar error message is seen in the
apupgradeorappmgntlogs during upgrade:
Run the following command from e1n1 to verify that platform manager has deactivated the service:appmgnt.utils.util: ERROR failed to check the status of appmgnt-rest {"error": {"description": "please make sure appmgnt-rest.service is running and try again.", "errorName": "RestServiceNotRunningError", "statusCode": 500}, "result": "Failure", "statusCode": 500}curl -sk https://localhost:5001/apupgrade/progress -u a:$(cat /run/magneto.token) -X GET
Workaround:[root@e1n1 ~]# curl -sk https://localhost:5001/apupgrade/progress -u a:$(cat /run/magneto.token) -X GET {"status": 0, "upgrade_in_progress": "False"}- Enable the platform manager upgrade mode on
e1n1:
curl -k https://localhost:5001/apupgrade/progress -u a:$(cat /run/magneto.token) -X PUT -d '{"upgrade_in_progress": "True"}'[root@e1n1 ~]# curl -k https://localhost:5001/apupgrade/progress -u a:$(cat /run/magneto.token) -X PUT -d '{"upgrade_in_progress": "True"}' {"status": 0, "message": "Application upgrade is enabled"} -
Log in to
e1n1-1-controland restart theappmgnt-service:systemctl restart appmgnt-restsystemctl status appmgnt-rest -l
- Enable the platform manager upgrade mode on
e1n1:
systemdtimeout issues- The default log level for OpenShift is 2 which sometime leads to timeouts on
systemd. The symptom for this issue is alerts raised by platform manager, for example:2171 | 2020-10-24 19:34:50 | SW_NEEDS_ATTENTION | 436: Failed to collect status from resource manager | vm_systemd_services@hw://enclosure2.node2 | MAJOR | N/A | 2172 | 2020-10-25 01:58:27 | SW_NEEDS_ATTENTION | 436: Failed to collect status from resource manager | vm_systemd_services@hw://enclosure2.node1 | MAJOR | N/A |In addition, the system might show the following symptoms:- OpenShift nodes not SSH-able;
systemctlcommands on VM nodes not responding- OpenShift nodes not ready for a long time and system response slow
virsh list --allto get the VM name;virsh destroy <VM name>to force stop the non-responding VM;virsh list --allto verify that the VM shows Shutdown;virsh start <VM name>to restart the VM.
Ensure the OpenShift nodes are all in ready state before proceeding with the workaround steps:oc get node -n wide for i in $(oc get node --no-headers | awk '{print $1}'); do ssh $i systemctl status atomic-openshift-node; doneThis can help get all the OpenShift nodes back to Ready states, and provide a clear environment before applying the workaround.
Workaround:
Run the following commands from any of the control nodes, for example,e1n1-1-control:- Change the log level in
master.envfile on all masters:$ for i in $(oc get node | grep master | awk '{print $1}'); do ssh $i "sed -i 's/DEBUG_LOGLEVEL=2/DEBUG_LOGLEVEL=0/' /etc/origin/master/master.env"; - Restart the controller on master
nodes:
$ for i in $(oc get node | grep master | awk '{print $1}'); do ssh $i "master-restart api && master-restart controllers"; done - Wait for the service to be up and running:
- Run the following command from any of the control
nodes:
for i in $(oc get node --no-headers | grep master | awk '{print $1}'); do oc get pod --all-namespaces | grep -E "master-api-$i|master-controllers-$i"; done - Make sure output for the above command looks similar to the following example, where all nodes
are in
Runningstate with1/1in third column in output for each line. You can run the command multiple times, or wait for a couple of minutes to get all the respective pods up and running. Example:[root@prunella1 ~]# for i in $(oc get node | grep master | awk '{print $1}'); do echo $i && oc get pod --all-namespaces | grep -E "master-api-$i|master-controllers-$i"; done prunella1.fyre.ibm.com kube-system master-api-prunella1.fyre.ibm.com 1/1 Running 1 6d kube-system master-controllers-prunella1.fyre.ibm.com 1/1 Running 1 6d prunella2.fyre.ibm.com kube-system master-api-prunella2.fyre.ibm.com 1/1 Running 2 6d kube-system master-controllers-prunella2.fyre.ibm.com 1/1 Running 1 6d prunella3.fyre.ibm.com kube-system master-api-prunella3.fyre.ibm.com 1/1 Running 1 6d kube-system master-controllers-prunella3.fyre.ibm.com 1/1 Running 1 6d
- Run the following command from any of the control
nodes:
- Change
atomicOpenShift node log level:$ for i in $(oc get node | awk '{print $1}'); do ssh $i "sed -i 's/DEBUG_LOGLEVEL=2/DEBUG_LOGLEVEL=0/' /etc/sysconfig/atomic-openshift-node"; done - Restart the node
services:
$ for i in $(oc get node | awk '{print $1}'); do ssh $i "systemctl restart atomic-openshift-node.service"; done - Make sure node service is active on all
nodes:
Ensure that allfor i in $(oc get node --no-headers | awk '{print $1}'); do ssh $i "systemctl status atomic-openshift-node.service | grep 'Active:'"; doneActive:statuses areactive (running)as in the sample output:FIPS mode initialized Active: active (running) since Thu 2020-10-29 12:38:05 PDT; 6 days ago FIPS mode initialized Active: active (running) since Thu 2020-10-29 12:38:05 PDT; 6 days ago FIPS mode initialized Active: active (running) since Thu 2020-10-29 12:38:05 PDT; 6 days ago FIPS mode initialized Active: active (running) since Thu 2020-10-29 12:38:26 PDT; 6 days ago FIPS mode initialized Active: active (running) since Thu 2020-10-29 12:38:28 PDT; 6 days ago FIPS mode initialized Active: active (running) since Thu 2020-10-29 12:38:28 PDT; 6 days ago
- CSR pending on the worker nodes
- The CSRs (Customer Signing Requests) are generated to extend the node certificate validity after
the default one year validity passes. You can approve them manually having the cluster administrator
privileges and enable the auto-approver to approve them automatically in the future. If CSR is in
pending status, the system might show the following symptoms:
- Worker nodes in
NotReadystate and pods may showNodeLost:[root@e2n1-1-worker ~]# oc get node NAME STATUS ROLES AGE VERSION e1n1-1-control.fbond Ready compute,infra,master 1y v1.11.0+d4cacc0 e1n2-1-control.fbond Ready compute,infra,master 1y v1.11.0+d4cacc0 e1n3-1-control.fbond Ready compute,infra,master 1y v1.11.0+d4cacc0 e1n4-1-worker.fbond NotReady compute 1y v1.11.0+d4cacc0 e2n1-1-worker.fbond NotReady compute 1y v1.11.0+d4cacc0 e2n2-1-worker.fbond NotReady compute 1y v1.11.0+d4cacc0 atomic-openshift-nodeservice failure to start on the worker nodes- Error message in
atomic-openshift-nodelog from the problem mode:Jan 19 22:52:49 e2n1-1-worker.fbond atomic-openshift-node[19159]: E0119 22:52:49.180917 19159 bootstrap.go:195] Part of the existing bootstrap client certificate is expired: 2021-01-15 00:28:00 +0000 UTC Jan 19 22:52:49 e2n1-1-worker.fbond atomic-openshift-node[19159]: I0119 22:52:49.180947 19159 bootstrap.go:56] Using bootstrap kubeconfig to generate TLS client cert, key and kubeconfig file - Showing CSR
Pendingcondition, when checking the status:[root@e1n1-1-control ~]# oc get csr NAME AGE REQUESTOR CONDITION csr-24sgr 17h system:admin Pending csr-25bvd 10h system:node:e1n3-1-control.fbond Pending csr-26c4l 4h system:node:e1n2-1-control.fbond Pending csr-2757q 4h system:node:e1n2-1-control.fbond Pending csr-27l4c 4h system:admin Pending csr-2j4h8 16h system:node:e1n2-1-control.fbond Pending csr-2knq7 5h system:node:e1n2-1-control.fbond Pending csr-2ltdq 16h system:admin Pending csr-2mvdd 5h system:node:e1n3-1-control.fbond Pending csr-2nf2k 8h system:node:e1n2-1-control.fbond Pending csr-2rtj7 6h system:admin Pending
Workaround:- Run the certificate approval
manually:
oc get csr -o name | xargs oc adm certificate approveIn
oc getnode, the nodes can recover to Ready state after a few minutes. - Deploy the auto-approver by
running:
ansible-playbook -i /opt/ibm/appmgt/config/hosts /usr/share/ansible/openshift-ansible/playbooks/openshift-master/enable_bootstrap.yml -vvvv -e openshift_master_bootstrap_auto_approve=true - Add
openshift_master_bootstrap_auto_approve=trueto /opt/ibm/appmgt/config/hosts:... [OSEv3:vars] openshift_master_bootstrap_auto_approve=true ... - Run the annual certificate expiry check. For more information, see Running Red Hat OpenShift certificate expiry check.
- Worker nodes in