IBM Support

PH59572: After Power On/Reset of an IBM z15 HW using DPM mode, a multi-node accelerator cluster using RoCE cards does not start

Subscribe

You can track all active APARs for this component.

 

APAR status

  • Closed as documentation error.

Error description

  • After Power On/Reset (POR) of an IBM z15 HW using DPM mode, a
    multi-node accelerator cluster using RoCE cards does not start.
    
    Looking into the system details, the RoCE cards were found in
    status "Stopped" (issue 1).
    After manual activation, the communication between head node and
    data nodes of the multi-node cluster still did not work.
    Looking again into the system details, it was unveiled, that the
    FIDs assigned to the RoCE cards were no longer the ones that
    were valid until the POR was executed (issue 2).
    After adapting the json file to obtain the new FIDs, the
    communication between head node and data nodes was working again
    making the accelerator operational again.
    
    Background information:
    On a z16, both issue 1 and issue 2 do NOT occur.
    
    On a z15 (and on a z14)
    
    - Issue 1 happens due to an architectural limitation of the
    hardware / firmware.
    - Issue 2 happens due to a bug in the firmware of the z15 (z14)
    machine.
    Curing issue 2 for an IBM z15 HW system requires installation of
    an MCL patch (P46598.557) that is included in bundle S92.
    
    Additional keywords:
    TS015305644 POR Z15 DPM ROCE FID BUNDLE S92 DT269759 DRIVER41C
    

Local fix

Problem summary

  • Problem Summary:
    After Power On/Reset of an IBM z15 machine (IBM model types
    8561, 8562) using DPM mode, a multi-node accelerator cluster
    using RoCE cards does not start.
    
    Users Affected:
    Customers for which all of the following applies; they are
    - running a multi-node accelerator
    - using RoCE cards for theinternode-communication
    - have deployed the accelerator on IBM Z15 hardware (IBM model
    types 8561, 8562)
    - using IBM Z Dynamic Partition Manager (DPM).
    
    Problem Scenario:
    See APAR Error Description.
    
    Problem Symptoms:
    See APAR Error Description.
    

Problem conclusion

  • The root cause has been fixed with IBM Z Driver 41C Firmware
    patch P46598.557. This patch is part of IBM Z Firmware Bundle
    S92 which was made available on October 16, 2024. If you belong
    to the affected customers as described above, upgrade your IBM
    Z15 firmware accordingly. With this Firmware patch installed,
    the FIDs of RoCE cards won't change after a POR on a machine
    operated using DPM.
    
    Please be aware of the following important statements:
    - After any POR of an IBM z15 machine (IBM model types 8561,
    8562) using DPM mode, RoCE cards need to be activated manually!
    (see the APAR Error Description above).
    - After having activated the RoCE cards, the LAPRs belonging to
    the multi-node cluster need to be stopped and started to pick up
    
    the system information concerning the status of the RoCE cards.
    

Temporary fix

Comments

APAR Information

  • APAR number

    PH59572

  • Reported component name

    ANYTCS ACCLTR Z

  • Reported component ID

    5697DA700

  • Reported release

    750

  • Status

    CLOSED DOC

  • PE

    NoPE

  • HIPER

    NoHIPER

  • Special Attention

    NoSpecatt / Xsystem

  • Submitted date

    2024-02-01

  • Closed date

    2024-06-20

  • Last modified date

    2025-02-05

  • APAR is sysrouted FROM one or more of the following:

  • APAR is sysrouted TO one or more of the following:

Fix information

Applicable component levels

[{"Business Unit":{"code":"BU011","label":"Systems - zSystems software"},"Product":{"code":"SG19M"},"Platform":[{"code":"PF054","label":"z Systems"}],"Version":"750"}]

Document Information

Modified date:
05 February 2025