APAR status
Closed as documentation error.
Error description
After Power On/Reset (POR) of an IBM z15 HW using DPM mode, a multi-node accelerator cluster using RoCE cards does not start. Looking into the system details, the RoCE cards were found in status "Stopped" (issue 1). After manual activation, the communication between head node and data nodes of the multi-node cluster still did not work. Looking again into the system details, it was unveiled, that the FIDs assigned to the RoCE cards were no longer the ones that were valid until the POR was executed (issue 2). After adapting the json file to obtain the new FIDs, the communication between head node and data nodes was working again making the accelerator operational again. Background information: On a z16, both issue 1 and issue 2 do NOT occur. On a z15 (and on a z14) - Issue 1 happens due to an architectural limitation of the hardware / firmware. - Issue 2 happens due to a bug in the firmware of the z15 (z14) machine. Curing issue 2 for an IBM z15 HW system requires installation of an MCL patch (P46598.557) that is included in bundle S92. Additional keywords: TS015305644 POR Z15 DPM ROCE FID BUNDLE S92 DT269759 DRIVER41C
Local fix
Problem summary
Problem Summary: After Power On/Reset of an IBM z15 machine (IBM model types 8561, 8562) using DPM mode, a multi-node accelerator cluster using RoCE cards does not start. Users Affected: Customers for which all of the following applies; they are - running a multi-node accelerator - using RoCE cards for theinternode-communication - have deployed the accelerator on IBM Z15 hardware (IBM model types 8561, 8562) - using IBM Z Dynamic Partition Manager (DPM). Problem Scenario: See APAR Error Description. Problem Symptoms: See APAR Error Description.
Problem conclusion
The root cause has been fixed with IBM Z Driver 41C Firmware patch P46598.557. This patch is part of IBM Z Firmware Bundle S92 which was made available on October 16, 2024. If you belong to the affected customers as described above, upgrade your IBM Z15 firmware accordingly. With this Firmware patch installed, the FIDs of RoCE cards won't change after a POR on a machine operated using DPM. Please be aware of the following important statements: - After any POR of an IBM z15 machine (IBM model types 8561, 8562) using DPM mode, RoCE cards need to be activated manually! (see the APAR Error Description above). - After having activated the RoCE cards, the LAPRs belonging to the multi-node cluster need to be stopped and started to pick up the system information concerning the status of the RoCE cards.
Temporary fix
Comments
APAR Information
APAR number
PH59572
Reported component name
ANYTCS ACCLTR Z
Reported component ID
5697DA700
Reported release
750
Status
CLOSED DOC
PE
NoPE
HIPER
NoHIPER
Special Attention
NoSpecatt / Xsystem
Submitted date
2024-02-01
Closed date
2024-06-20
Last modified date
2025-02-05
APAR is sysrouted FROM one or more of the following:
APAR is sysrouted TO one or more of the following:
Fix information
Applicable component levels
[{"Business Unit":{"code":"BU011","label":"Systems - zSystems software"},"Product":{"code":"SG19M"},"Platform":[{"code":"PF054","label":"z Systems"}],"Version":"750"}]
Document Information
Modified date:
05 February 2025