PH55922: AFTER UPGRADING AN ACCELERATOR MULTI-NODE INSTALLATION TO MAINTENANCE LEVEL 7.5.11 (7.5.10) THE DATA NODES FAIL TO START

APAR status

Closed as documentation error.

Error description

In the middle of upgrading an accelerator multi-node
installation to maintenance level 7.5.11 (7.5.10), the upgrade
seems to be stuck: the "updating" step is taking almost one
hour and doesn't make progress.
Customers having deployed Db2 Analytics Accelerator on zSystems
as a multinode installation are affected if the following
conditions apply:
- the boot disk used is a SCSI device
AND either of the following:
- the hardware model is z16 or LinuxONE 4 and the firmware (MCL)
  level is less or equal to Bundle S30.
- the hardware model is z15 or LinuxONE 3 and the firmware (MCL)
  level is between Bundle S71 and Bundle S83 (including).

Additional keywords:
TS013392810 TS013618520 multinode installation SCSI

IBM internal service information:
If logged on to the box, look at

journalctl -b | egrep '(PUT|POST|GET)'

If analyzing a SSC dump, go to
cdump-sosreport-.../var/log/journal directory and then do

journalctl -D . -b | egrep '(PUT|POST|GET)' > rest_calls.txt

This will return information like

Jul 16 18:17:51 head python3[9132]: rest_lib.py- DEBUG - GET:
https://10.15.36.200/
Jul 16 18:17:55 head python3[9132]: rest_lib.py- DEBUG - GET:
https://10.15.36.199/
Jul 16 18:18:00 head python3[9132]: rest_lib.py- DEBUG - GET:
https://10.15.36.198/
Jul 16 18:18:24 head python3[9132]: rest_lib.py- DEBUG - GET:
https://10.15.36.200/
Jul 16 18:58:52 head uwsgi[5478]: PUT
/api/com.ibm.aqt/configuration/validate_credentials
                                    File
"/var/www/api/configuration.py", line 180, in PUT
                                      return
self.request('POST', url, data=data, json=json, **kwargs)
Jul 16 18:58:52 head selog_cli[28169]: SELOG: 2a5a00a2
0000000000000001 ZFPC_RST LOGCLASS:25 LOGTYPE:00 LOGACTION:0
LOGCOMPONENT:zfpc PUT
/api/com.ibm.aqt/configuration/validate_credentials:
requests.exceptions.ConnectionError:
HTTPSConnectionPool(host=*10.15.36.198*, port=443): Max retries
exceeded with url: /api/com.ibm.zaci.system/api-tokens (Caused
by
NewConnectionError(*<urllib3.connection.VerifiedHTTPSConnection
object at 0x3ff8847eaf0>: Failed to establish a new connection:
[Errno 113] No route to host*)); from
/usr/lib/python3/dist-packages/requests/adapters.py, line 516,
in send: raise ConnectionError(e, request=request)
with IP addresses from all data nodes, like 10.15.36.200 in the
example above.

You can then check the update sequence for all data nodes, like
10.15.36.196 in this example:

grep 196 rest_calls.txt
...
Jul 16 17:57:52 head python3[9132]: rest_lib.py- DEBUG - GET:
https://10.15.36.196/
Jul 16 17:58:25 head python3[9132]: rest_lib.py- DEBUG - GET:
https://10.15.36.196/
Jul 16 17:58:55 head python3[9132]: rest_lib.py- DEBUG - GET:
https://10.15.36.196/
Jul 16 17:58:55 head python3[9132]: rest_lib.py- DEBUG - POST:
https://10.15.36.196/api/com.ibm.zaci.system/api-tokens
Jul 16 17:58:55 head python3[9132]: rest_lib.py- DEBUG - GET:
https://10.15.36.196/api/com.ibm.zaci.system/appliance
Jul 16 17:58:55 head python3[9132]: rest_lib.py- DEBUG - POST:
https://10.15.36.196/api/com.ibm.zaci.system/api-tokens
Jul 16 18:58:06 head uwsgi-core[5477]: rest_lib.py- DEBUG -
POST: https://10.15.36.196/api/com.ibm.zaci.system/api-tokens

If this ends with successful api-tokens requests as with .196
here, it worked for that LPAR. If it ends with REST call related
errors or if it just hangs for more than, say, 10 minutes, we
might have the issue described here, with SSC installer in IBM
firmware being broken.

Local fix

How to avoid the issue during the update of the accelerator:
1) Use the accelerator Admin UI update wizard as usual in
order to quiesce components and to download the export file.
2) Go to the HMC:
   a) Bring all LPARs (head as well as data) into SSC installer
      mode
   b) Deactivate all LPARs (use 'stop' if HMC is running in DPM
      mode)
   c) Activate all LPARs (use 'start' if HMC is running in DPM
      mode). Please activate the head node after activating all
      data nodes to make it easier for PR/SM to find optimal
      placement.
3) Install the new image on the head node.
4) Using the accelerator Admin UI finalize an update using
   the export file.

How to overcome the issue when it is present during an update of
the accelerator:
1) Go to the HMC.
   a) Bring all data node LPARs into SSC installer mode.
   b) Deactivate all data node LPARs (use 'stop' if HMC is
      running in DPM mode).
   c) Activate all data node LPARs (use 'start' if HMC is
      running in DPM mode).
2) In a remote maintenance session with IBM support
   a) ssh logon to the head node.
   b) /root/tools/restart-r2d2.sh
   c) Enter 'idaa_service'.
      Enter 'system_state UPDATE_CLUSTER_WAIT_CREDENTIALS'.
      Enter 'system_state' to verify the successful setting.
      Enter 'exit'

At this point, the client can see the Admin UI panel asking for
cluster credentials again in the Admin UI, and the installation
attempt can be repeated.

Note:
if a client does not want to give IBM Support ssh access to the
accelerator via remote maintenance session, the client could
restart the update from scratch using the description from
above, (-> manually setting all LPARs into SSC installer mode
and manually re-activating all LPARs).

Problem summary

Problem Summary:
An upgrade of an Accelerator multi-node deployment seems to be
stuck: the "updating" step is taking almost one hour and doesn't
make progress.

Users Affected:
Administrators of Accelerator on IBM Z multi-node deployments
using zFCP / SCSI boot devices.

Problem Scenario:
Update and probably also initial installation of the Accelerator
on IBM Z are affected, if zFCP / SCSI boot disks are used and
if the
- the hardware model is z16 or LinuxONE 4 (IBM machine type 3931
  or 3932) and the firmware (MCL) level is less or equal to
  Bundle S30.
- the hardware model is z15 or LinuxONE 3 (IBM machine types
  8561 or 8562) and the firmware (MCL) level is between Bundle
  S71 and Bundle S83.

Note:
Accelerator deployments using ECKD boot disk are not affected.

Problem Symptoms:
See Problem Summary.

Problem conclusion

Upgrade the Firmware of your IBM Z system:
- for the z15 environment (IBM machine types 8561 and 8562) the
  fix is MCL P46655.015 released in D41C Bundle S84 (on Dec 19,
  2023).
- for the z16 environment (IBM machine type 3931) the fix will
  be part of the D51C Bundle S31 (ETA: end of April 2024).

Temporary fix

Comments

APAR Information

APAR number
PH55922
Reported component name
ANYTCS ACCLTR Z
Reported component ID
5697DA700
Reported release
750
Status
CLOSED DOC
PE
NoPE
HIPER
NoHIPER
Special Attention
NoSpecatt / Xsystem
Submitted date
2023-07-21
Closed date
2024-03-18
Last modified date
2024-03-18

APAR is sysrouted FROM one or more of the following:
APAR is sysrouted TO one or more of the following:

Fix information

Applicable component levels

[{"Business Unit":{"code":"BU011","label":"Systems - zSystems software"},"Product":{"code":"SG19M"},"Platform":[{"code":"PF054","label":"z Systems"}],"Version":"750"}]

Document Information

Modified date:
18 March 2024

Tips

PH55922: AFTER UPGRADING AN ACCELERATOR MULTI-NODE INSTALLATION TO MAINTENANCE LEVEL 7.5.11 (7.5.10) THE DATA NODES FAIL TO START

Subscribe

APAR status

Closed as documentation error.

Error description

Local fix

Problem summary

Problem conclusion

Temporary fix

Comments

APAR Information

APAR number

Reported component name

Reported component ID

Reported release

Status

PE

HIPER

Special Attention

Submitted date

Closed date

Last modified date

APAR is sysrouted FROM one or more of the following:

APAR is sysrouted TO one or more of the following:

Fix information

Applicable component levels

Document Information

Share your feedback

Need support?