Version 1.0.7.3 release notes

Cloud Pak for Data System version 1.0.7.3 includes security and monitoring enhancements, interactive visualizations in the web console, and a number of fixes.

Upgrading

Your system must be on version 1.0.4.x, 1.0.5.x or 1.0.7.x to upgrade to version 1.0.7.3. Follow the upgrade instructions at Upgrading to version 1.0.7.3.
Note:
  1. New preliminary check process is now available with --preliminary-check-with-fixes option. The process is no longer only checking for possible issues but it also attempts to automatically fix any known issues during pre-checks.
  2. The recommended location to stage the VM and Services bundles is GPFS storage. You must use /opt/ibm/appliance/storage/platform/localrepo/<version> instead of /localrepo/<version>. The recommended filepath is provided in the upgrade procedure steps.

What's new

More strict security settings for SSL connections
Starting with version 1.0.7.3, a modern setting is used for ROUTER_CIPHERS as a default. It provides an extremely high level of security. If you need to loosen the security settings, follow these steps:
  1. Run the following command from e1n1-1-control:
    oc edit dc router -n default
  2. Customize the ROUTER_CIPHERS as you need, for example, change the value to intermediate, which is what was used in previous releases. For more information, see https://wiki.mozilla.org/Security/Server_Side_TLS.
Log forwarding feature
You can now forward the control node logs to a remote log server with the apsyslog command as described in Forwarding logs to a remote server.
Serviceability
The Netezza related logs (specifically SPU logs) collection with apdiag will now need the --spus option. You can use the --spus option with a space-separated list of SPU names, or use --spus all to collect logs for all SPUs in the system. For example:
apdiag collect --components ips/ --spus spu0101 spu0102
For more information on apdiag, see apdiag command reference.
Web console improvements
New, interactive visualizations of hardware components and their actual location in the rack. You can zoom in and out the rack view, filter the view by status or by component type, and view the selected component details.

Software components

Fixed issues

  • Fixed the issue with portworx-api pods missing on control nodes.
  • Fixed the issue with ap sw showing wrong web console version.
  • Fixed the OpenShift upgrade issues with route reconfiguration.
  • Fixed the issue with services upgrade failing due to cluster certificate expiry errors.
  • Fixed the user management issue with the admin user not being able to add services in the web console after the upgrade.
  • Fixed the issue with the upgrade failing due to timeout while running the appcli healthcheck --app=portworx command.
  • Fixed the issues with services upgrade failing with a timeout errors.
  • Fixed the Call Home page access issues for Asia-Pacific customers.
  • Fixed the issue with the system being not accessible from a remote system using the external IP after network configuration steps (alert 610: Floating IP interface is up, but inactive in network manager). Platform manager self-corrects the issue.
  • Fixed the issue with multiple Unreacheable or missing device for NSD events being opened.

Known issues

Upgrade might fail if e1n1 is not the hub in Platform Manager
Before you start the upgrade ensure that e1n1 is the hub, as described in the upgrade procedure. Otherwise several nodes might become inaccessible or not responding in the expected timeframe after nodes reboot, and the upgrade fails with a similar error:
2021-03-09 21:40:05 INFO: Some nodes were not available.
                          [u'e15n1', u'e16n1', u'e1n4', u'e2n1', u'e2n2', u'e2n3', u'e2n4']
2021-03-09 21:40:05 ERROR: Error running command [systemctl restart network] on [u'e2n4-fab', u'e2n3-fab', u'e15n1-fab', u'e2n1-fab', u'e16n1-fab', u'e1n4-fab', u'e2n2-fab']
2021-03-09 21:40:05 ERROR: Unable to restart network services on [u'e15n1-fab', u'e16n1-fab', u'e1n4-fab', u'e2n1-fab', u'e2n2-fab', u'e2n3-fab', u'e2n4-fab']
2021-03-09 21:40:05 ERROR: ERROR: Error running command [systemctl restart network] on [u'e2n4-fab', u'e2n3-fab', u'e15n1-fab', u'e2n1-fab', u'e16n1-fab', u'e1n4-fab', u'e2n2-fab']
2021-03-09 21:40:05 ERROR: 
2021-03-09 21:40:05 ERROR: Unable to powercycle nodes via ipmitool.
2021-03-09 21:40:05 ERROR: 'bmc_addr'
2021-03-09 21:40:05 ERROR: The following nodes are still unavailable after a reboot attempt: [u'e15n1', u'e16n1', u'e1n4', u'e2n1', u'e2n2', u'e2n3', u'e2n4']
2021-03-09 21:40:05 FATAL ERROR: Problem rebooting nodes
To recover from the error:
  1. Check for the hub node by verifying that the dhcpd service is running on the control nodes:
    systemctl is-active dhcpd
  2. If the dhcpd service is not running, on e1n1, run:
    systemctl start dhcpd
  3. Bring up the floating IPs:
    1. On e1n2 and e1n3, run:
      /opt/ibm/appliance/platform/management/actions/master_disable.py -scope master
    2. On e1n1, run:
      /opt/ibm/appliance/platform/management/actions/master_enable.py -scope master
  4. Rerun the upgrade where it failed.
OpenShift certificates might expire after a year
It is required to run the annual certificate expiry check and ensure the certificates are validated for the following year. For more information, see Running Red Hat OpenShift certificate expiry check.
System web console starts with the onboarding procedure after the upgrade
Starting with version 1.0.7.3, the U-position setting in hardware configuration for enclosures and switches is mandatory. If your system does not have these fields set, the web console opens up with the onboarding procedure and you must fill in all the required fields to be able to proceed to the home page.
Upgrade from 1.0.4.x with switch upgrade fails with GPFS fatal error but restarts
System bundle upgrade might fail at GPFS upgrade with a fatal error in the log:
58:34 FATAL ERROR: GpfsUpgrader.install : Fatal Problem: Could not copy files to all nodes.
However, the nodes are rebooted and the upgrade restarts and continues. GPFS is upgraded and you can ignore the failure.
Portworx volume mount fails with error signal: killed or exit status 32
If an application pod fails to start and the Events section in pod description shows volume mount issue due to signal:killed error, for example:
[root@e1n1-1-control ~]#  oc describe pod <pod-name> -n <pod-namespace> | grep -A 10 ^Events:
Events:
  Type     Reason     Age                 From                          Message
  ----     ------     ----                ----                          -------
  Warning  FailedMount  49s (x8 over 3m)  kubelet, e2n1-1-worker.fbond  MountVolume.SetUp failed for volume "pvc-9731b999-f85b-11ea-be79-003658063f81" : rpc error: code = Internal desc = Failed to mount volume pvc-9731b999-f85b-11ea-be79-003658063f81: signal: killed
apply the workaround described in Troubleshooting issues with Portworx.
False Portworx upgrade failed message in the upgrade log
If the upgrade reports that the Portworx upgrade failed and you can see the following error:
Running version not as expected: Failed to confirm PX upgrade version
Portworx upgraded successfully but the version was not updated in time on one of the nodes. Workaround:
  1. Verify that in the appmgt.log the Portworx upgrade is successful:
    2020-09-29 10:18:19,130 [INFO] <appmgnt.rest.server.restServer:bound_func:110058> [appmgnt.tools.portworx.portworx(portworx.py:645)] - portworx_upgrade: all portworx pods in the cluster are up and running
    2020-09-29 10:18:19,132 [INFO] <appmgnt.rest.server.restServer:bound_func:110058> [appmgnt.tools.portworx.portworx(portworx.py:646)] - portworx_upgrade: check_px_pods - Finish.
    2020-09-29 10:18:19,133 [INFO] <appmgnt.rest.server.restServer:bound_func:110058> [appmgnt.tools.portworx.portworx(portworx.py:657)] - Portworx upgrade: confirm_px_upgrade_success - entering....
    2020-09-29 10:18:19,135 [INFO] <appmgnt.rest.server.restServer:bound_func:110058> [appmgnt.utils.portworx_util.PortworxUtil(portworx_util.py:376)] - get_px_running_versions: start....
    2020-09-29 10:18:19,137 [INFO] <appmgnt.rest.server.restServer:bound_func:110058> [appmgnt.utils.command(command.py:189)] - Executing: /bin/bash -c pxctl cluster list |sed '1,6d' | sed 's/[[:space:]]\+/#/g' |cut -d '#' -f 10
    2020-09-29 10:18:19,139 [INFO] <appmgnt.rest.server.restServer:bound_func:110058> [appmgnt.utils.command(command.py:198)] - with input = None, env = None, dir = None
    2020-09-29 10:18:19,304 [INFO] <appmgnt.rest.server.restServer:bound_func:110058> [appmgnt.utils.command(command.py:234)] - returncode = 0, raiseOnExitStatus = True
    2020-09-29 10:18:19,304 [DEBUG] <appmgnt.rest.server.restServer:bound_func:110058> [appmgnt.utils.command(command.py:243)] -
    2020-09-29 10:18:19,307 [DEBUG] <appmgnt.rest.server.restServer:bound_func:110058> [appmgnt.utils.command(command.py:244)] - 2.5.0.1-ff37efc
    2.3.1.1-3fbe1bf
    2.5.0.1-ff37efc
    2.5.0.1-ff37efc
    2.5.0.1-ff37efc
    2.5.0.1-ff37efc
    
    
    2020-09-29 10:18:19,309 [INFO] <appmgnt.rest.server.restServer:bound_func:110058> [appmgnt.utils.portworx_util.PortworxUtil(portworx_util.py:379)] - get_px_running_versions: ['2.5.0.1-ff37efc', '2.3.1.1-3fbe1bf', '2.5.0.1-ff37efc', '2.5.0.1-ff37efc', '2.5.0.1-ff37efc', '2.5.0.1-ff37efc']
    2020-09-29 10:18:19,310 [INFO] <appmgnt.rest.server.restServer:bound_func:110058> [appmgnt.tools.portworx.portworx(portworx.py:661)] - Portworx has successfully upgraded to: 2.5.0.1-ff37efc
  2. Run commands to verify that Portworx is upgraded and running new version:
    pxctl status
    oc get pods -n kube-system
  3. Resume the upgrade process.
Switch configuration automation breaks the Netezza Performance Server network setup
Version 1.0.7.0 introduces automation for application network switch settings. However, if Netezza Performance Server is installed on your system, and you decide to modify your application network setup, do not run the switch automation step, as it will break the Netezza instance. You must configure the switches manually as described in Deprecated: Manually configuring fabric network switches.
Upgrade breaks house network configuration for systems with NPS
NPS systems with a 10G house network will have broken management network operations (dns/ldap/time/callhome) following the upgrade. If you have Netezza installed, follow the post-upgrade steps described in Netezza Performance Server post-upgrade steps to fix the network setup.
Upgrade from versions earlier than 1.0.7.x with NPS installed is failing
When NPS is installed on a system running versions earlier than 1.0.7.x, apupgrade might fail due to unknown MAC address for fabric switch.

Workaround:

  1. Rerun apupgrade with --skip-firmware and --skip-hw-cfg options:
    apupgrade --upgrade --use-version 1.0.7.3_release --upgrade-directory /localrepo --bundle system --skip-firmware --skip-hw-cfg
  2. Continue with the upgrade.
  3. Once successfully finished, rerun the system bundle upgrade to update the firmware on the system:
    apupgrade --upgrade --use-version 1.0.7.3_release --upgrade-directory /localrepo --bundle system
Cloud Pak for Data must be installed on Zen namespace
Cloud Pak for Data must always be installed on Zen namespace. Otherwise, the upgrade of this component will fail, and the VM must be redeployed, which means all data is lost.
Upgrade precheck fails with IP conflict error
If the upgrade precheck fails with a similar error:
ERROR     : [/etc/sysconfig/network-scripts/ifup-eth] Error, some other host (40:F2:E9:2F:99:57) already uses address 10.8.9.250.
verify if ASA/ARP settings are enabled. Cisco device configuration needs to be fixed so that it doesn't respond to the ARP requests which are not designated to it.
apupgrade precheck failing because of missing files

In some of cases the yosemite-kube-1.0.7.3-SNAPSHOT-x86_64.tar.gz file might be missing from /opt/ibm/appliance/storage/platform/upgrade/icp4d_console/ and /install/app_img/icp4d_console location.

Workaround:

Copy the missing files from the existing upgrade bundle to the following locations:
  • /opt/ibm/appliance/storage/platform/upgrade/icp4d_console/
  • /install/app_img/icp4d_console
Service bundle upgrade failing due to a read timeout
If a similar error is logged during upgrade:
2020-10-02 09:24:44 TRACE: RC: 255.
                           STDOUT: [{"error": {"description": "Unexpected error happen, failed to get response: HTTPSConnectionPool(host='127.0.0.1', port=9443): Read timed out. (read timeout=600)", "errorName": "HttpsRequestError", "statusCode": 500}, "result": "Failure", "statusCode": 500}
                           ]
                           STDERR: [
                           ]
The system should come online after some time. Ignore the error and continue with the upgrade.
The Save button for the Call Home configuration in the web console is inactive
You must first enable either Problem Reporting or Status Reporting in the Control tab of the Call Home settings in order to be able to save any further configuration settings entered for Call Home and Notifications (Alert Management).
appmgnt-rest service is stopped during upgrade
If a similar error message is seen in the apupgrade or appmgnt logs during upgrade:
appmgnt.utils.util: ERROR    failed to check the status of appmgnt-rest
{"error": {"description": "please make sure appmgnt-rest.service is running and try again.", "errorName": "RestServiceNotRunningError", "statusCode": 500}, "result": "Failure", "statusCode": 500}
Run the following command from e1n1 to verify that platform manager has deactivated the service:
curl -sk https://localhost:5001/apupgrade/progress -u a:$(cat /run/magneto.token) -X GET
[root@e1n1 ~]# curl -sk https://localhost:5001/apupgrade/progress -u a:$(cat /run/magneto.token) -X GET
{"status": 0, "upgrade_in_progress": "False"} 
Workaround:
  1. Enable the platform manager upgrade mode on e1n1:
    curl -k https://localhost:5001/apupgrade/progress -u a:$(cat /run/magneto.token) -X PUT -d '{"upgrade_in_progress": "True"}'
    [root@e1n1 ~]# curl -k https://localhost:5001/apupgrade/progress -u a:$(cat /run/magneto.token) -X PUT -d '{"upgrade_in_progress": "True"}'
    {"status": 0, "message": "Application upgrade is enabled"}
  2. Log in to e1n1-1-control and restart the appmgnt-service:
    systemctl restart appmgnt-rest
    systemctl status appmgnt-rest -l
systemd timeout issues
The default log level for OpenShift is 2 which sometime leads to timeouts on systemd. The symptom for this issue is alerts raised by platform manager, for example:
2171 | 2020-10-24 19:34:50 | SW_NEEDS_ATTENTION | 436: Failed to collect status from resource manager | vm_systemd_services@hw://enclosure2.node2         |    MAJOR |          N/A |
2172 | 2020-10-25 01:58:27 | SW_NEEDS_ATTENTION | 436: Failed to collect status from resource manager | vm_systemd_services@hw://enclosure2.node1         |    MAJOR |          N/A |
In addition, the system might show the following symptoms:
  • OpenShift nodes not SSH-able;
  • systemctl commands on VM nodes not responding
  • OpenShift nodes not ready for a long time and system response slow
If any of the above happen, reboot the problematic VMs from the bare metal by running:
  1. virsh list --all to get the VM name;
  2. virsh destroy <VM name> to force stop the non-responding VM;
  3. virsh list --all to verify that the VM shows Shutdown;
  4. virsh start <VM name> to restart the VM.
Ensure the OpenShift nodes are all in ready state before proceeding with the workaround steps:
oc get node -n wide
for i in $(oc get node  --no-headers | awk '{print $1}'); do ssh $i systemctl status atomic-openshift-node; done

This can help get all the OpenShift nodes back to Ready states, and provide a clear environment before applying the workaround.

Workaround:

Run the following commands from any of the control nodes, for example, e1n1-1-control:
  1. Change the log level in master.env file on all masters:
    $ for i in $(oc get node | grep master | awk '{print $1}'); do ssh $i "sed -i 's/DEBUG_LOGLEVEL=2/DEBUG_LOGLEVEL=0/'  /etc/origin/master/master.env";
  2. Restart the controller on master nodes:
    $ for i in $(oc get node | grep master | awk '{print $1}'); do ssh $i "master-restart api && master-restart controllers"; done
  3. Wait for the service to be up and running:
    1. Run the following command from any of the control nodes:
      for i in $(oc get node  --no-headers | grep master | awk '{print $1}'); do oc get pod --all-namespaces | grep -E "master-api-$i|master-controllers-$i"; done
    2. Make sure output for the above command looks similar to the following example, where all nodes are in Running state with 1/1 in third column in output for each line. You can run the command multiple times, or wait for a couple of minutes to get all the respective pods up and running. Example:
      [root@prunella1 ~]# for i in $(oc get node | grep master | awk '{print $1}'); do echo $i && oc get pod --all-namespaces | grep -E "master-api-$i|master-controllers-$i"; done 
      prunella1.fyre.ibm.com
      kube-system                         master-api-prunella1.fyre.ibm.com                          1/1       Running     1          6d
      kube-system                         master-controllers-prunella1.fyre.ibm.com                  1/1       Running     1          6d
      prunella2.fyre.ibm.com
      kube-system                         master-api-prunella2.fyre.ibm.com                          1/1       Running     2          6d
      kube-system                         master-controllers-prunella2.fyre.ibm.com                  1/1       Running     1          6d
      prunella3.fyre.ibm.com
      kube-system                         master-api-prunella3.fyre.ibm.com                          1/1       Running     1          6d
      kube-system                         master-controllers-prunella3.fyre.ibm.com                  1/1       Running     1          6d
      
  4. Change atomic OpenShift node log level:
    $ for i in $(oc get node | awk '{print $1}'); do ssh $i "sed -i 's/DEBUG_LOGLEVEL=2/DEBUG_LOGLEVEL=0/'  /etc/sysconfig/atomic-openshift-node"; done
  5. Restart the node services:
    $ for i in $(oc get node | awk '{print $1}'); do ssh $i "systemctl restart atomic-openshift-node.service"; done
  6. Make sure node service is active on all nodes:
    for i in $(oc get node --no-headers | awk '{print $1}'); do ssh $i "systemctl status atomic-openshift-node.service | grep 'Active:'"; done
    Ensure that all Active: statuses are active (running) as in the sample output:
    FIPS mode initialized
       Active: active (running) since Thu 2020-10-29 12:38:05 PDT; 6 days ago
    FIPS mode initialized
       Active: active (running) since Thu 2020-10-29 12:38:05 PDT; 6 days ago
    FIPS mode initialized
       Active: active (running) since Thu 2020-10-29 12:38:05 PDT; 6 days ago
    FIPS mode initialized
       Active: active (running) since Thu 2020-10-29 12:38:26 PDT; 6 days ago
    FIPS mode initialized
       Active: active (running) since Thu 2020-10-29 12:38:28 PDT; 6 days ago
    FIPS mode initialized
       Active: active (running) since Thu 2020-10-29 12:38:28 PDT; 6 days ago
Services upgrade summary reports blank values for version numbers
Once the services upgrade completes successfully, when it reports the final results of the upgrade, it reports empty brackets instead of version numbers, as in the following sample:
Upgrade results

services : successfully upgraded from 
{} to {}
services Upgrade time : 3:16:47
All upgrade steps complete.

Broadcast message from root@e1n1 (Thu Sep 24 21:10:40 2020):

A running upgrade completed successfully.

Broadcast message from root@e1n1 (Thu Sep 24 21:10:40 2020):

See details at /var/log/appliance/apupgrade/20200924/apupgrade20200924173509.log for further information
This is a known issue.
Upgrade to 1.0.7.3 leaves a db2wh pod in unhealthy state
If after the upgrade one of the db2wh pods is in an unhealthy state as in the example:
[root@e1n1-1-control ~]# oc get pods --all-namespaces=True | egrep -vi "Running|Complete"
NAMESPACE                           NAME                                                              READY     STATUS      RESTARTS   AGE
zen                                 db2wh-1600568372108-db2u-ldap-b9f64d4f4-2n2mm                     0/1       Init:0/1    0          7h
Run the following steps to fix it:
  1. Run the following command to check the status of the pods:
    oc get oc get pods --all-namespaces=True | egrep -vi "Running|Complete"
    Make sure to give some time for the pod to come up after the upgrade. Sometimes it might take a bit longer, but not in hour(s).
  2. Run the following command:
    oc describe pod -n zen
    For example:
    oc describe db2wh-1600568372108-db2u-ldap-b9f64d4f4-2n2mm -n zen
    the following output is shown:
    Warning  Unhealthy  3m               kubelet, e2n3-1-db2wh.fbond  Liveness probe failed:
      Warning  Unhealthy  2m (x2 over 3m)  kubelet, e2n3-1-db2wh.fbond  Readiness probe failed:
    
  3. Delete the pod to recover:
    oc delete pod db2wh-1600568372108-db2u-ldap-b9f64d4f4-2n2mm  -n zen
CSR pending on the worker nodes
The CSRs (Customer Signing Requests) are generated to extend the node certificate validity after the default one year validity passes. You can approve them manually having the cluster administrator privileges and enable the auto-approver to approve them automatically in the future. If CSR is in pending status, the system might show the following symptoms:
  • Worker nodes in NotReady state and pods may show NodeLost:
    [root@e2n1-1-worker ~]# oc get node
    NAME                   STATUS                        ROLES                  AGE       VERSION
    e1n1-1-control.fbond   Ready                         compute,infra,master   1y        v1.11.0+d4cacc0
    e1n2-1-control.fbond   Ready                         compute,infra,master   1y        v1.11.0+d4cacc0
    e1n3-1-control.fbond   Ready                         compute,infra,master   1y        v1.11.0+d4cacc0
    e1n4-1-worker.fbond    NotReady                      compute                1y        v1.11.0+d4cacc0
    e2n1-1-worker.fbond    NotReady                      compute                1y        v1.11.0+d4cacc0
    e2n2-1-worker.fbond    NotReady                      compute                1y        v1.11.0+d4cacc0
  • atomic-openshift-node service failure to start on the worker nodes
  • Error message in atomic-openshift-node log from the problem mode:
    Jan 19 22:52:49 e2n1-1-worker.fbond atomic-openshift-node[19159]: E0119 22:52:49.180917   19159 bootstrap.go:195] Part of the existing bootstrap client certificate is expired: 2021-01-15 00:28:00 +0000 UTC
    Jan 19 22:52:49 e2n1-1-worker.fbond atomic-openshift-node[19159]: I0119 22:52:49.180947   19159 bootstrap.go:56] Using bootstrap kubeconfig to generate TLS client cert, key and kubeconfig file
  • Showing CSR Pending condition, when checking the status:
    [root@e1n1-1-control ~]# oc get csr
    NAME                                                   AGE       REQUESTOR                                                 CONDITION
    csr-24sgr                                              17h       system:admin                                              Pending
    csr-25bvd                                              10h       system:node:e1n3-1-control.fbond                          Pending
    csr-26c4l                                              4h        system:node:e1n2-1-control.fbond                          Pending
    csr-2757q                                              4h        system:node:e1n2-1-control.fbond                          Pending
    csr-27l4c                                              4h        system:admin                                              Pending
    csr-2j4h8                                              16h       system:node:e1n2-1-control.fbond                          Pending
    csr-2knq7                                              5h        system:node:e1n2-1-control.fbond                          Pending
    csr-2ltdq                                              16h       system:admin                                              Pending
    csr-2mvdd                                              5h        system:node:e1n3-1-control.fbond                          Pending
    csr-2nf2k                                              8h        system:node:e1n2-1-control.fbond                          Pending
    csr-2rtj7                                              6h        system:admin                                              Pending
Workaround:
  1. Run the certificate approval manually:
    oc get csr -o name | xargs oc adm certificate approve

    In oc get node, the nodes can recover to Ready state after a few minutes.

  2. Deploy the auto-approver by running:
    ansible-playbook -i /opt/ibm/appmgt/config/hosts /usr/share/ansible/openshift-ansible/playbooks/openshift-master/enable_bootstrap.yml -vvvv -e openshift_master_bootstrap_auto_approve=true
  3. Add openshift_master_bootstrap_auto_approve=true to /opt/ibm/appmgt/config/hosts:
    ...
    
    [OSEv3:vars]
    openshift_master_bootstrap_auto_approve=true
    ...
  4. Run the annual certificate expiry check. For more information, see Running Red Hat OpenShift certificate expiry check.