Version 1.0.7.0 release notes

IBM® Cloud Pak for Data System version 1.0.7.0 introduces automated switch configuration and automated node expansion. It also includes the 1.0.5.2 hotfix, and a few enhancements to the web consoles.

Upgrading

The upgrade procedure is described in Upgrading to version 1.0.7.0. Your system must be on 1.0.4.x or newer to upgrade.

Note: If you are using Cloud Pak for Data on the system, it is recommended that you wait with this upgrade until 1.0.7.1 is released, which includes an updated version of Cloud Pak for Data. Once 1.0.7.1 is released, you can upgrade your system to 1.0.7.0 first, and then to 1.0.7.1.

What's new

Network

Switch configuration automation
The System_Name.yml file is extended to include the switch configuration for customer facing ports. Note that the switches section is only available in the template after you upgrade the system, so you have to manually add this section before the upgrade if you want to preserve switch configuration. For more information refer to the upgrade instructions.
yml file validation tool
The aposYmlCheck.py utility is now available, which ensures that the yml file is valid, and that all the required values (IP address, netmask, gateway) meet their proper typing. For more information, see Validating the YAML file.

Node expansion automated discovery

When the system is expanded, new nodes are automatically discovered. You can then configure them as described in Adding new nodes to the system.

NodeOS

Custom hostname support
Improved DHCP scaling
Reduced OpenShift Container Platform control VM memory

Platform management (already introduced in 1.0.5.2)

Better handling of Netezza host container migration from a heavy loaded node (high SWAP and CPU usage). Before, the host was migrated from e1n1 to e1n2 and back to e1n1. Now it will only be moved once. Also the floating IP (the IP for connecting to the system) if moved from e1n1 to e1n2 will not go back to e1n1 for as long as e1n2 is OK.
The Netezza host container will wait for the floating IP and confirmation that the other Netezza host containers from the previous node are STOPPED. The timeout for such wait is 5 minutes. When that time passes, the container is started even if the floating IP is not up or the other Netezza host container is not STOPPED.
Storage utilization alert for /var/log/audit will no longer be triggered for over 80%, by default it can take more than 80%. After upgrading to 1.0.5.2, the old storage utilization alerts must be manually closed with ap issues --close <alert_id>.
The state of Portworx cluster is now reported with ap sw and not only ap sw -d.

Web console

Storage options such as Portworx, hostpath for nodes running Db2 Warehouse
Resource usage enhancements including metering information for software and hardware

NPS® console new widgets

Query Performance
Queries summary for Period
Queue monitor
Allocation charts
Swap space usage
Resource monitors with spu and maxspu

Security

Excluded the POSIX Attributes requirements of users in Windows Active Directory to integrate with Cloud Pak for Data System.

Components

The following components are upgraded:

Red Hat Enterprise Linux 7.8
Red Hat OpenShift Container Platform 3.11.188
GPFS 4.2.3-22
Cumulus operating system for fabric and management switches upgraded to version 3.7.11
Note: When you run the system bundle upgrade, the upgrade process attempts to quietly update firmware, Cumulus and OS on the fabric and management switch. If errors occur during that stage, a warning that the switches were unable to be updated is displayed. You can continue with the upgrade process to upgrade the remaining components. The system will be operational when you finish, but it is strongly advised that you contact IBM Support to assist in fixing the issues with the switch updates. They must be upgraded before you apply the next upgrade.

Fixed issues

Fixed the issue with alerts being opened for the Cloud Pak for Data not ready while the system is still starting. Introduced the 45 minutes wait time for the Cloud Pak for Data and OpenShift to start and be fully operational. If they do not start properly during this time, an alert is opened.
Fixed an issue with Finisar transceivers not working correctly in nodes - implemented a workaround to force link speed and duplex settings. While this issue is fixed in Cloud Pak for Data System 1.0.7.0, it is still present on systems with Netezza® Performance Server for Cloud Pak for Data System version 11.0.5.0 and lower. Upgrade the Netezza Performance Server for Cloud Pak for Data System to version 11.0.6.0 to avoid the issue.

Known issues

OpenShift certificates might expire after a year

It is required to run the annual certificate expiry check and ensure the certificates are validated for the following year. For more information, see Running Red Hat OpenShift certificate expiry check.

Upgrade breaks house network configuration for systems with NPS

NPS systems with a 10G house network will have broken management network operations (dns/ldap/time/callhome) following the upgrade. If you have Netezza installed, follow the steps described in Netezza Performance Server section in the 1.0.7.0 upgrade procedure.

Services bundle upgrade might fail reporting unhealthy pod

If services bundle upgrade fails and the upgrade log indicates that the failure is in Post-Upgrade Health Check:

ERROR: Post-Upgrade Health Check: Openshift health check failed: Unhealthy pod exists in the cluster, please check the log /opt/ibm/appliance/storage/platform/appmgt/logs/appmgt.log for more detail.

then confirm that OpenShift upgrade is complete via presence of these messages in /opt/ibm/appliance/storage/platform/appmgt/logs/appmgt.log:

Ansible upgrade complete successfully.
After this point, any failure may not indicate you have to rerun the upgrade again to fix it.
Enter post_upgrade_check for rhos

and pod master-etcd-e1n1-1-control.fbond is unhealthy due to

Image is not
set

[root@e1n1-1-control ~]# oc get events -n kube-system | grep master-etcd-e1n1-1-control.fbond
3m          3h           997       master-etcd-e1n1-1-control.fbond.162015ab43ee33a1   Pod                   Warning   FailedSync   kubelet, e1n1-1-control.fbond   error determining status: Image is not set

Workaround:

See step 15 in the upgrade procedure.

Note: To prevent this issue, run the following commands before you start the services upgrade:

To clean up the NotReady pods on OpenShift nodes, run the following command from e1n1-1-control:

nodes=$(oc get node --no-headers | awk '{print $1}')
for n in $nodes; do ssh $n "pods=$(crictl pods -q --s NotReady) && if [ ! -n $pods ]; then crictl rmp $pods; fi "; done

Run:

oc get pod -n default
oc get pod -n kube-system

If your see the pods in 1/1 Running state, you can start the services upgrade.

Upgrade fails with OpenShift health check error

The upgrade process might terminate with a similar error:

Logfile: /var/log/appliance/apupgrade/20200705/apupgrade20200705085921.log
Upgrade Command: apupgrade --upgrade --application all --upgrade-directory /localrepo --bundle services --use-version 1.0.7.0_release

1. Rhos_ContainersUpgrader.preinstall
        Upgrade Detail: Component install for services
        Caller Info:The call was made from 'Rhos_ContainersPreInstaller.preinstall' on line 64 with file located at '/localrepo/1.0.7.0_release/EXTRACT/services/upgrade/bundle_upgraders/../services/rhos_containers_preinstaller.py'
        Message: {u'message': [u'rhos failed Health Check, check log for more detail'], u'statusCode': 400, u'result': u'Failure', u'error': {u'description': u'Openshift health check failed: Unhealthy pod exists in the cluster, please check the log /opt/ibm/appliance/storage/platform/appmgt/logs/appmgt.log for more detail.', u'errorName': u'OpenshiftHealthCheckError', u'statusCode': 400}}

Workaround:

Log into e1n1-1-control.

Run the following command to verify that all pods show status READY 1/1:

[root@e1n1-1-control ~]# oc get pod -n default 
NAME                       READY     STATUS              RESTARTS   AGE
docker-registry-7-sn6zg    1/1       Running             0          9m
registry-console-2-56qx7   1/1       Running             1          31d
router-7-crnps             1/1       Running             0          14m
router-7-dh95z             1/1       Running             0          15m
router-7-s8lvm             1/1       Running             0          15m

If any of the pods show ContainerCreating, wait a moment and rerun the command. You might need to run it a couple of times until all pods are in READY 1/1 state.

Log into e1n1 and restart the upgrade process with the following command:

apupgrade --upgrade --application all --upgrade-directory /localrepo --bundle services --use-version 1.0.7.0_release --application all

Unable to add services with admin user in the web console

Adding services such as Data Virtualization, or new projects for Python Jupyter Notebook etc. must be executed with an administrative user different than the default admin user. The different uid values used in admin user will cause errors when adding services in the Cloud Pak for Data console. A fix for this issue is provided as step 16 in the upgrade procedure.

Network configuration issues

Switch configuration not preserved during upgrade.
Workaround:

Add the switches section to the System_Name.yml file before the upgrade to 1.0.7.0 starts, as described in Upgrading to version 1.0.7.0.
Switch configuration automation breaks the Netezza Performance Server network setup.
Version 1.0.7.0 introduces automation for application network switch settings. However, if Netezza Performance Server is installed on your system, and you decide to modify your application network setup, do not run the switch automation step, as it will break the Netezza instance. You must configure the switches manually. If you ran the automation script by mistake, contact IBM Support to fix the configuration. IBM Support may reference this document (password required).

Services installed with portworx-shared-gp might cause data loss

If any Cloud Pak for Data services were deployed using portworx-shared-gp storage class with single replica count storage volume, it could cause data loss and raise issue with Portworx not being able to disable a node due to single replication.

Workaround:

Run the following steps from a control VM (or a node with Portworx where pxctl commands can be run):

Increase replica count from 1 to 2:

for i in $(pxctl v l |awk '$5 ~ /^1$/ {print $1}' ); do pxctl v u --repl 2 $i; done

Increase replica count from 2 to 3:

for i in $(pxctl v l |awk '$5 ~ /^2$/ {print $1}' ); do pxctl v u --repl 3 $i; done

apupgrade command output suggests wrong upgrade command syntax

When running apupgrade --preliminary-check command before the upgrade, the command output suggest the next command to run, but the syntax in the sample is wrong:

Please review the release notes for this version at http://ibm.biz/icpds_rn prior to running the upgrade.
Upgrade command:
apupgrade --upgrade --use-version 1.0.7.0_release --upgrade-directory /localrepo --bundle services

Follow the upgrade procedure as described in Upgrading to version 1.0.7.0 to avoid errors.

Errors when running house network configuration check mode

When running the following command in the Testing the YAML file and running playbooks procedure:

ANSIBLE_HASH_BEHAVIOUR=merge ansible-playbook -i ./System_Name.yml  playbooks/house_config.yml --check -v

in any of the following cases:

no DNS setup (initial installation)
DNS servers being changed
switch/IP networks being changed

the command might fail with errors similar to:

TASK [Attempt to reach FQDN from inside VM] ****************************************************************************************************************************
fatal: [node1]: FAILED! => {"changed": true, "cmd": "ssh apphub.fbond \"ping -c 3 gt04-app.swg.usma.ibm.com\"", "delta": "0:00:00.466787", "end": "2020-07-09 21:51:33.311372", "msg": "non-zero return code", "rc": 2, "start": "2020-07-09 21:51:32.844585", "stderr": "\nping: gt04-app.swg.usma.ibm.com: Name or service not known", "stderr_lines": ["", "ping: gt04-app.swg.usma.ibm.com: Name or service not known"], "stdout": "", "stdout_lines": []}
...ignoring
TASK [Check FQDN result] ***********************************************************************************************************************************************
fatal: [node1]: FAILED! => {"changed": false, "msg": "Failed to reach FQDN gt04-app.swg.usma.ibm.com"}

You can ignore these errors and continue with the procedure.

Upgrade fails with floating IP issues

The following error message might be logged at the upgrade precheck step or later during the system upgrade:

Unable to bring up floating ip 9.0.32.15/19 because of ERROR: Error running command [ifup 9.0.32.15/19].
RC: 1.
STDOUT: []
STDERR: [/sbin/ifup: configuration for 9.0.32.15/19 not found.
Usage: ifup

Workaround:

Check for the hub node by verifying that the dhcpd service is running. Run the command:
```
systemctl is-active dhcpd
```
If the dhcpd service is running on a node other than e1n1, bring the service down on that node:
```
systemctl stop dhcpd
```
and start the service on e1n1:
```
systemctl start dhcpd
```
On all control nodes other than e1n1, run the following two commands:
```
ifdown fab-br0:Fl
```
```
ifdown mgt-br0:Fl
```
On e1n1 run the following two commands:
```
ifup fab-br0:Fl
```
```
ifup mgt-br0:Fl
```

Check if the floating IPs are up:

9.0.32.15/19 : fab-br0:Fl
9.0.0.15/19 : mgt-br0:Fl

[root@e1n1 ~]# ip addr show | grep 9.0.32.15/19
inet 9.0.32.15/19 brd 9.0.63.255 scope global secondary fab-br0:99
[root@e1n1 ~]# ip addr show | grep 9.0.0.15/19
inet 9.0.0.15/19 brd 9.0.31.255 scope global secondary mgt-br0:99

You can restart the upgrade process.

The nzstart command returns errors such as nzstart: Error: insufficient tmpfs size

If you are upgrading from version 1.0.4.x to 1.0.7.0 you might encounter the following error after running nzstart:

nzstart: Error: insufficient tmpfs size

In that case, redeploy the NPS container as described in Redeploying the container.

appmgnt-rest service is stopped during upgrade

If a similar error message is seen in the apupgrade or appmgnt logs during upgrade:

appmgnt.utils.util: ERROR    failed to check the status of appmgnt-rest
{"error": {"description": "please make sure appmgnt-rest.service is running and try again.", "errorName": "RestServiceNotRunningError", "statusCode": 500}, "result": "Failure", "statusCode": 500}

Run the following command from e1n1 to verify that platform manager has deactivated the service:

curl -sk https://localhost:5001/apupgrade/progress -u a:$(cat /run/magneto.token) -X GET

[root@e1n1 ~]# curl -sk https://localhost:5001/apupgrade/progress -u a:$(cat /run/magneto.token) -X GET
{"status": 0, "upgrade_in_progress": "False"}

Workaround:

Enable the platform manager upgrade mode on e1n1:

curl -k https://localhost:5001/apupgrade/progress -u a:$(cat /run/magneto.token) -X PUT -d '{"upgrade_in_progress": "True"}'

[root@e1n1 ~]# curl -k https://localhost:5001/apupgrade/progress -u a:$(cat /run/magneto.token) -X PUT -d '{"upgrade_in_progress": "True"}'
{"status": 0, "message": "Application upgrade is enabled"}

systemctl restart appmgnt-rest

systemctl status appmgnt-rest -l

systemd timeout issues

The default log level for OpenShift is 2 which sometime leads to timeouts on systemd. The symptom for this issue is alerts raised by platform manager, for example:

2171 | 2020-10-24 19:34:50 | SW_NEEDS_ATTENTION | 436: Failed to collect status from resource manager | vm_systemd_services@hw://enclosure2.node2         |    MAJOR |          N/A |
2172 | 2020-10-25 01:58:27 | SW_NEEDS_ATTENTION | 436: Failed to collect status from resource manager | vm_systemd_services@hw://enclosure2.node1         |    MAJOR |          N/A |

In addition, the system might show the following symptoms:

OpenShift nodes not SSH-able;
systemctl commands on VM nodes not responding
OpenShift nodes not ready for a long time and system response slow

If any of the above happen, reboot the problematic VMs from the bare metal by running:

virsh list --all to get the VM name;
virsh destroy <VM name> to force stop the non-responding VM;
virsh list --all to verify that the VM shows Shutdown;
virsh start <VM name> to restart the VM.

Ensure the OpenShift nodes are all in ready state before proceeding with the workaround steps:

oc get node -n wide
for i in $(oc get node  --no-headers | awk '{print $1}'); do ssh $i systemctl status atomic-openshift-node; done

This can help get all the OpenShift nodes back to Ready states, and provide a clear environment before applying the workaround.

Workaround:

Run the following commands from any of the control nodes, for example, e1n1-1-control:

Change the log level in master.env file on all masters:

$ for i in $(oc get node | grep master | awk '{print $1}'); do ssh $i "sed -i 's/DEBUG_LOGLEVEL=2/DEBUG_LOGLEVEL=0/'  /etc/origin/master/master.env";

Restart the controller on master nodes:

$ for i in $(oc get node | grep master | awk '{print $1}'); do ssh $i "master-restart api && master-restart controllers"; done

Wait for the service to be up and running:

Run the following command from any of the control nodes:

for i in $(oc get node  --no-headers | grep master | awk '{print $1}'); do oc get pod --all-namespaces | grep -E "master-api-$i|master-controllers-$i"; done

Make sure output for the above command looks similar to the following example, where all nodes are in Running state with 1/1 in third column in output for each line. You can run the command multiple times, or wait for a couple of minutes to get all the respective pods up and running. Example:

[root@prunella1 ~]# for i in $(oc get node | grep master | awk '{print $1}'); do echo $i && oc get pod --all-namespaces | grep -E "master-api-$i|master-controllers-$i"; done 
prunella1.fyre.ibm.com
kube-system                         master-api-prunella1.fyre.ibm.com                          1/1       Running     1          6d
kube-system                         master-controllers-prunella1.fyre.ibm.com                  1/1       Running     1          6d
prunella2.fyre.ibm.com
kube-system                         master-api-prunella2.fyre.ibm.com                          1/1       Running     2          6d
kube-system                         master-controllers-prunella2.fyre.ibm.com                  1/1       Running     1          6d
prunella3.fyre.ibm.com
kube-system                         master-api-prunella3.fyre.ibm.com                          1/1       Running     1          6d
kube-system                         master-controllers-prunella3.fyre.ibm.com                  1/1       Running     1          6d

Change atomic OpenShift node log level:

$ for i in $(oc get node | awk '{print $1}'); do ssh $i "sed -i 's/DEBUG_LOGLEVEL=2/DEBUG_LOGLEVEL=0/'  /etc/sysconfig/atomic-openshift-node"; done

Restart the node services:

$ for i in $(oc get node | awk '{print $1}'); do ssh $i "systemctl restart atomic-openshift-node.service"; done

Make sure node service is active on all nodes:

for i in $(oc get node --no-headers | awk '{print $1}'); do ssh $i "systemctl status atomic-openshift-node.service | grep 'Active:'"; done

Ensure that all Active: statuses are active (running) as in the sample output:

FIPS mode initialized
   Active: active (running) since Thu 2020-10-29 12:38:05 PDT; 6 days ago
FIPS mode initialized
   Active: active (running) since Thu 2020-10-29 12:38:05 PDT; 6 days ago
FIPS mode initialized
   Active: active (running) since Thu 2020-10-29 12:38:05 PDT; 6 days ago
FIPS mode initialized
   Active: active (running) since Thu 2020-10-29 12:38:26 PDT; 6 days ago
FIPS mode initialized
   Active: active (running) since Thu 2020-10-29 12:38:28 PDT; 6 days ago
FIPS mode initialized
   Active: active (running) since Thu 2020-10-29 12:38:28 PDT; 6 days ago

CSR pending on the worker nodes

The CSRs (Customer Signing Requests) are generated to extend the node certificate validity after the default one year validity passes. You can approve them manually having the cluster administrator privileges and enable the auto-approver to approve them automatically in the future. If CSR is in pending status, the system might show the following symptoms:

Worker nodes in NotReady state and pods may show NodeLost:

[root@e2n1-1-worker ~]# oc get node
NAME                   STATUS                        ROLES                  AGE       VERSION
e1n1-1-control.fbond   Ready                         compute,infra,master   1y        v1.11.0+d4cacc0
e1n2-1-control.fbond   Ready                         compute,infra,master   1y        v1.11.0+d4cacc0
e1n3-1-control.fbond   Ready                         compute,infra,master   1y        v1.11.0+d4cacc0
e1n4-1-worker.fbond    NotReady                      compute                1y        v1.11.0+d4cacc0
e2n1-1-worker.fbond    NotReady                      compute                1y        v1.11.0+d4cacc0
e2n2-1-worker.fbond    NotReady                      compute                1y        v1.11.0+d4cacc0

atomic-openshift-node service failure to start on the worker nodes

Error message in atomic-openshift-node log from the problem mode:

Jan 19 22:52:49 e2n1-1-worker.fbond atomic-openshift-node[19159]: E0119 22:52:49.180917   19159 bootstrap.go:195] Part of the existing bootstrap client certificate is expired: 2021-01-15 00:28:00 +0000 UTC
Jan 19 22:52:49 e2n1-1-worker.fbond atomic-openshift-node[19159]: I0119 22:52:49.180947   19159 bootstrap.go:56] Using bootstrap kubeconfig to generate TLS client cert, key and kubeconfig file

Showing CSR Pending condition, when checking the status:

[root@e1n1-1-control ~]# oc get csr
NAME                                                   AGE       REQUESTOR                                                 CONDITION
csr-24sgr                                              17h       system:admin                                              Pending
csr-25bvd                                              10h       system:node:e1n3-1-control.fbond                          Pending
csr-26c4l                                              4h        system:node:e1n2-1-control.fbond                          Pending
csr-2757q                                              4h        system:node:e1n2-1-control.fbond                          Pending
csr-27l4c                                              4h        system:admin                                              Pending
csr-2j4h8                                              16h       system:node:e1n2-1-control.fbond                          Pending
csr-2knq7                                              5h        system:node:e1n2-1-control.fbond                          Pending
csr-2ltdq                                              16h       system:admin                                              Pending
csr-2mvdd                                              5h        system:node:e1n3-1-control.fbond                          Pending
csr-2nf2k                                              8h        system:node:e1n2-1-control.fbond                          Pending
csr-2rtj7                                              6h        system:admin                                              Pending

Workaround:

Run the certificate approval manually:
```
oc get csr -o name | xargs oc adm certificate approve
```
In oc get node, the nodes can recover to Ready state after a few minutes.

Deploy the auto-approver by running:

ansible-playbook -i /opt/ibm/appmgt/config/hosts /usr/share/ansible/openshift-ansible/playbooks/openshift-master/enable_bootstrap.yml -vvvv -e openshift_master_bootstrap_auto_approve=true

Add openshift_master_bootstrap_auto_approve=true to /opt/ibm/appmgt/config/hosts:
```
...

[OSEv3:vars]
openshift_master_bootstrap_auto_approve=true
...
```
Run the annual certificate expiry check. For more information, see Running Red Hat OpenShift certificate expiry check.