APAR status
Closed as documentation error.
Error description
In the middle of upgrading an accelerator multi-node installation to maintenance level 7.5.11 (7.5.10), the upgrade seems to be stuck: the "updating" step is taking almost one hour and doesn't make progress. Customers having deployed Db2 Analytics Accelerator on zSystems as a multinode installation are affected if the following conditions apply: - the boot disk used is a SCSI device AND either of the following: - the hardware model is z16 or LinuxONE 4 and the firmware (MCL) level is less or equal to Bundle S30. - the hardware model is z15 or LinuxONE 3 and the firmware (MCL) level is between Bundle S71 and Bundle S83 (including). Additional keywords: TS013392810 TS013618520 multinode installation SCSI IBM internal service information: If logged on to the box, look at journalctl -b | egrep '(PUT|POST|GET)' If analyzing a SSC dump, go to cdump-sosreport-.../var/log/journal directory and then do journalctl -D . -b | egrep '(PUT|POST|GET)' > rest_calls.txt This will return information like Jul 16 18:17:51 head python3[9132]: rest_lib.py- DEBUG - GET: https://10.15.36.200/ Jul 16 18:17:55 head python3[9132]: rest_lib.py- DEBUG - GET: https://10.15.36.199/ Jul 16 18:18:00 head python3[9132]: rest_lib.py- DEBUG - GET: https://10.15.36.198/ Jul 16 18:18:24 head python3[9132]: rest_lib.py- DEBUG - GET: https://10.15.36.200/ Jul 16 18:58:52 head uwsgi[5478]: PUT /api/com.ibm.aqt/configuration/validate_credentials File "/var/www/api/configuration.py", line 180, in PUT return self.request('POST', url, data=data, json=json, **kwargs) Jul 16 18:58:52 head selog_cli[28169]: SELOG: 2a5a00a2 0000000000000001 ZFPC_RST LOGCLASS:25 LOGTYPE:00 LOGACTION:0 LOGCOMPONENT:zfpc PUT /api/com.ibm.aqt/configuration/validate_credentials: requests.exceptions.ConnectionError: HTTPSConnectionPool(host=*10.15.36.198*, port=443): Max retries exceeded with url: /api/com.ibm.zaci.system/api-tokens (Caused by NewConnectionError(*<urllib3.connection.VerifiedHTTPSConnection object at 0x3ff8847eaf0>: Failed to establish a new connection: [Errno 113] No route to host*)); from /usr/lib/python3/dist-packages/requests/adapters.py, line 516, in send: raise ConnectionError(e, request=request) with IP addresses from all data nodes, like 10.15.36.200 in the example above. You can then check the update sequence for all data nodes, like 10.15.36.196 in this example: grep 196 rest_calls.txt ... Jul 16 17:57:52 head python3[9132]: rest_lib.py- DEBUG - GET: https://10.15.36.196/ Jul 16 17:58:25 head python3[9132]: rest_lib.py- DEBUG - GET: https://10.15.36.196/ Jul 16 17:58:55 head python3[9132]: rest_lib.py- DEBUG - GET: https://10.15.36.196/ Jul 16 17:58:55 head python3[9132]: rest_lib.py- DEBUG - POST: https://10.15.36.196/api/com.ibm.zaci.system/api-tokens Jul 16 17:58:55 head python3[9132]: rest_lib.py- DEBUG - GET: https://10.15.36.196/api/com.ibm.zaci.system/appliance Jul 16 17:58:55 head python3[9132]: rest_lib.py- DEBUG - POST: https://10.15.36.196/api/com.ibm.zaci.system/api-tokens Jul 16 18:58:06 head uwsgi-core[5477]: rest_lib.py- DEBUG - POST: https://10.15.36.196/api/com.ibm.zaci.system/api-tokens If this ends with successful api-tokens requests as with .196 here, it worked for that LPAR. If it ends with REST call related errors or if it just hangs for more than, say, 10 minutes, we might have the issue described here, with SSC installer in IBM firmware being broken.
Local fix
How to avoid the issue during the update of the accelerator: 1) Use the accelerator Admin UI update wizard as usual in order to quiesce components and to download the export file. 2) Go to the HMC: a) Bring all LPARs (head as well as data) into SSC installer mode b) Deactivate all LPARs (use 'stop' if HMC is running in DPM mode) c) Activate all LPARs (use 'start' if HMC is running in DPM mode). Please activate the head node after activating all data nodes to make it easier for PR/SM to find optimal placement. 3) Install the new image on the head node. 4) Using the accelerator Admin UI finalize an update using the export file. How to overcome the issue when it is present during an update of the accelerator: 1) Go to the HMC. a) Bring all data node LPARs into SSC installer mode. b) Deactivate all data node LPARs (use 'stop' if HMC is running in DPM mode). c) Activate all data node LPARs (use 'start' if HMC is running in DPM mode). 2) In a remote maintenance session with IBM support a) ssh logon to the head node. b) /root/tools/restart-r2d2.sh c) Enter 'idaa_service'. Enter 'system_state UPDATE_CLUSTER_WAIT_CREDENTIALS'. Enter 'system_state' to verify the successful setting. Enter 'exit' At this point, the client can see the Admin UI panel asking for cluster credentials again in the Admin UI, and the installation attempt can be repeated. Note: if a client does not want to give IBM Support ssh access to the accelerator via remote maintenance session, the client could restart the update from scratch using the description from above, (-> manually setting all LPARs into SSC installer mode and manually re-activating all LPARs).
Problem summary
Problem Summary: An upgrade of an Accelerator multi-node deployment seems to be stuck: the "updating" step is taking almost one hour and doesn't make progress. Users Affected: Administrators of Accelerator on IBM Z multi-node deployments using zFCP / SCSI boot devices. Problem Scenario: Update and probably also initial installation of the Accelerator on IBM Z are affected, if zFCP / SCSI boot disks are used and if the - the hardware model is z16 or LinuxONE 4 (IBM machine type 3931 or 3932) and the firmware (MCL) level is less or equal to Bundle S30. - the hardware model is z15 or LinuxONE 3 (IBM machine types 8561 or 8562) and the firmware (MCL) level is between Bundle S71 and Bundle S83. Note: Accelerator deployments using ECKD boot disk are not affected. Problem Symptoms: See Problem Summary.
Problem conclusion
Upgrade the Firmware of your IBM Z system: - for the z15 environment (IBM machine types 8561 and 8562) the fix is MCL P46655.015 released in D41C Bundle S84 (on Dec 19, 2023). - for the z16 environment (IBM machine type 3931) the fix will be part of the D51C Bundle S31 (ETA: end of April 2024).
Temporary fix
Comments
APAR Information
APAR number
PH55922
Reported component name
ANYTCS ACCLTR Z
Reported component ID
5697DA700
Reported release
750
Status
CLOSED DOC
PE
NoPE
HIPER
NoHIPER
Special Attention
NoSpecatt / Xsystem
Submitted date
2023-07-21
Closed date
2024-03-18
Last modified date
2024-03-18
APAR is sysrouted FROM one or more of the following:
APAR is sysrouted TO one or more of the following:
Fix information
Applicable component levels
[{"Business Unit":{"code":"BU011","label":"Systems - zSystems software"},"Product":{"code":"SG19M"},"Platform":[{"code":"PF054","label":"z Systems"}],"Version":"750"}]
Document Information
Modified date:
18 March 2024