IBM Support

PH55922: AFTER UPGRADING AN ACCELERATOR MULTI-NODE INSTALLATION TO MAINTENANCE LEVEL 7.5.11 (7.5.10) THE DATA NODES FAIL TO START

Subscribe

You can track all active APARs for this component.

 

APAR status

  • Closed as documentation error.

Error description

  • In the middle of upgrading an accelerator multi-node
    installation to maintenance level 7.5.11 (7.5.10), the upgrade
    seems to be stuck: the "updating" step is taking almost one
    hour and doesn't make progress.
    Customers having deployed Db2 Analytics Accelerator on zSystems
    as a multinode installation are affected if the following
    conditions apply:
    - the boot disk used is a SCSI device
    AND either of the following:
    - the hardware model is z16 or LinuxONE 4 and the firmware (MCL)
      level is less or equal to Bundle S30.
    - the hardware model is z15 or LinuxONE 3 and the firmware (MCL)
      level is between Bundle S71 and Bundle S83 (including).
    
    Additional keywords:
    TS013392810 TS013618520 multinode installation SCSI
    
    IBM internal service information:
    If logged on to the box, look at
    
    journalctl -b | egrep '(PUT|POST|GET)'
    
    If analyzing a SSC dump, go to
    cdump-sosreport-.../var/log/journal directory and then do
    
    journalctl -D . -b | egrep '(PUT|POST|GET)' > rest_calls.txt
    
    This will return information like
    
    Jul 16 18:17:51 head python3[9132]: rest_lib.py- DEBUG - GET:
    https://10.15.36.200/
    Jul 16 18:17:55 head python3[9132]: rest_lib.py- DEBUG - GET:
    https://10.15.36.199/
    Jul 16 18:18:00 head python3[9132]: rest_lib.py- DEBUG - GET:
    https://10.15.36.198/
    Jul 16 18:18:24 head python3[9132]: rest_lib.py- DEBUG - GET:
    https://10.15.36.200/
    Jul 16 18:58:52 head uwsgi[5478]: PUT
    /api/com.ibm.aqt/configuration/validate_credentials
                                        File
    "/var/www/api/configuration.py", line 180, in PUT
                                          return
    self.request('POST', url, data=data, json=json, **kwargs)
    Jul 16 18:58:52 head selog_cli[28169]: SELOG: 2a5a00a2
    0000000000000001 ZFPC_RST LOGCLASS:25 LOGTYPE:00 LOGACTION:0
    LOGCOMPONENT:zfpc PUT
    /api/com.ibm.aqt/configuration/validate_credentials:
    requests.exceptions.ConnectionError:
    HTTPSConnectionPool(host=*10.15.36.198*, port=443): Max retries
    exceeded with url: /api/com.ibm.zaci.system/api-tokens (Caused
    by
    NewConnectionError(*<urllib3.connection.VerifiedHTTPSConnection
    object at 0x3ff8847eaf0>: Failed to establish a new connection:
    [Errno 113] No route to host*)); from
    /usr/lib/python3/dist-packages/requests/adapters.py, line 516,
    in send: raise ConnectionError(e, request=request)
    with IP addresses from all data nodes, like 10.15.36.200 in the
    example above.
    
    You can then check the update sequence for all data nodes, like
    10.15.36.196 in this example:
    
    grep 196 rest_calls.txt
    ...
    Jul 16 17:57:52 head python3[9132]: rest_lib.py- DEBUG - GET:
    https://10.15.36.196/
    Jul 16 17:58:25 head python3[9132]: rest_lib.py- DEBUG - GET:
    https://10.15.36.196/
    Jul 16 17:58:55 head python3[9132]: rest_lib.py- DEBUG - GET:
    https://10.15.36.196/
    Jul 16 17:58:55 head python3[9132]: rest_lib.py- DEBUG - POST:
    https://10.15.36.196/api/com.ibm.zaci.system/api-tokens
    Jul 16 17:58:55 head python3[9132]: rest_lib.py- DEBUG - GET:
    https://10.15.36.196/api/com.ibm.zaci.system/appliance
    Jul 16 17:58:55 head python3[9132]: rest_lib.py- DEBUG - POST:
    https://10.15.36.196/api/com.ibm.zaci.system/api-tokens
    Jul 16 18:58:06 head uwsgi-core[5477]: rest_lib.py- DEBUG -
    POST: https://10.15.36.196/api/com.ibm.zaci.system/api-tokens
    
    If this ends with successful api-tokens requests as with .196
    here, it worked for that LPAR. If it ends with REST call related
    errors or if it just hangs for more than, say, 10 minutes, we
    might have the issue described here, with SSC installer in IBM
    firmware being broken.
    

Local fix

  • How to avoid the issue during the update of the accelerator:
    1) Use the accelerator Admin UI update wizard as usual in
    order to quiesce components and to download the export file.
    2) Go to the HMC:
       a) Bring all LPARs (head as well as data) into SSC installer
          mode
       b) Deactivate all LPARs (use 'stop' if HMC is running in DPM
          mode)
       c) Activate all LPARs (use 'start' if HMC is running in DPM
          mode). Please activate the head node after activating all
          data nodes to make it easier for PR/SM to find optimal
          placement.
    3) Install the new image on the head node.
    4) Using the accelerator Admin UI finalize an update using
       the export file.
    
    How to overcome the issue when it is present during an update of
    the accelerator:
    1) Go to the HMC.
       a) Bring all data node LPARs into SSC installer mode.
       b) Deactivate all data node LPARs (use 'stop' if HMC is
          running in DPM mode).
       c) Activate all data node LPARs (use 'start' if HMC is
          running in DPM mode).
    2) In a remote maintenance session with IBM support
       a) ssh logon to the head node.
       b) /root/tools/restart-r2d2.sh
       c) Enter 'idaa_service'.
          Enter 'system_state UPDATE_CLUSTER_WAIT_CREDENTIALS'.
          Enter 'system_state' to verify the successful setting.
          Enter 'exit'
    
    At this point, the client can see the Admin UI panel asking for
    cluster credentials again in the Admin UI, and the installation
    attempt can be repeated.
    
    Note:
    if a client does not want to give IBM Support ssh access to the
    accelerator via remote maintenance session, the client could
    restart the update from scratch using the description from
    above, (-> manually setting all LPARs into SSC installer mode
    and manually re-activating all LPARs).
    

Problem summary

  • Problem Summary:
    An upgrade of an Accelerator multi-node deployment seems to be
    stuck: the "updating" step is taking almost one hour and doesn't
    make progress.
    
    Users Affected:
    Administrators of Accelerator on IBM Z multi-node deployments
    using zFCP / SCSI boot devices.
    
    Problem Scenario:
    Update and probably also initial installation of the Accelerator
    on IBM Z are affected, if zFCP / SCSI boot disks are used and
    if the
    - the hardware model is z16 or LinuxONE 4 (IBM machine type 3931
      or 3932) and the firmware (MCL) level is less or equal to
      Bundle S30.
    - the hardware model is z15 or LinuxONE 3 (IBM machine types
      8561 or 8562) and the firmware (MCL) level is between Bundle
      S71 and Bundle S83.
    
    Note:
    Accelerator deployments using ECKD boot disk are not affected.
    
    Problem Symptoms:
    See Problem Summary.
    

Problem conclusion

  • Upgrade the Firmware of your IBM Z system:
    - for the z15 environment (IBM machine types 8561 and 8562) the
      fix is MCL P46655.015 released in D41C Bundle S84 (on Dec 19,
      2023).
    - for the z16 environment (IBM machine type 3931) the fix will
      be part of the D51C Bundle S31 (ETA: end of April 2024).
    

Temporary fix

Comments

APAR Information

  • APAR number

    PH55922

  • Reported component name

    ANYTCS ACCLTR Z

  • Reported component ID

    5697DA700

  • Reported release

    750

  • Status

    CLOSED DOC

  • PE

    NoPE

  • HIPER

    NoHIPER

  • Special Attention

    NoSpecatt / Xsystem

  • Submitted date

    2023-07-21

  • Closed date

    2024-03-18

  • Last modified date

    2024-03-18

  • APAR is sysrouted FROM one or more of the following:

  • APAR is sysrouted TO one or more of the following:

Fix information

Applicable component levels

[{"Business Unit":{"code":"BU011","label":"Systems - zSystems software"},"Product":{"code":"SG19M"},"Platform":[{"code":"PF054","label":"z Systems"}],"Version":"750"}]

Document Information

Modified date:
18 March 2024